Use OMEGGA to make model

hgscott commented 7 months ago

Ilija shared a narrative using the new OMEGGA app for annotating.making a model.

hgscott commented 2 months ago

My current narrative for making the Amac model is: "MIT1002 RAST MS2 beta"

I was using the MI11002_gff_genome, because the anvio file failed
I built the model using MS2
I manually edited the model to fix the energy generating cycle (removed one reaction)
I attempted to gapfill on glucose, acetate, and 3-hydroxybutarate, but it could not grow on any of them

hgscott commented 2 months ago

Things I need to figure out:

[x] Loading the correct genome into KBase (#61)
[ ] Defining the media
[ ] Make the gapfilling work (#55)
[ ] Checking that gapfilling does not cause growth on everything (#20)

hgscott commented 2 months ago

What is the correct genome?

This file I used in the make_model.py is "MIT1002_anvio_prot_seqs.fa", saved in the genome folder (not pushed because of size)

hgscott commented 2 months ago

Uploading that file to KBase:

Can't upload as a GFF+FASTA genome, becuase we don't have a GFF (MIT1002_gene_calls_20231115.gff has the wrong number of gene calls, 4116 instead of 4106, so it is the old version from Zac)
Tried importing as a "FASTA assembly"
- Failed because that is only for nucleotide files

I think I just need a GFF file, I will ask Rogier/student/Sam. (Update: Asked all three of them via slack on 2024-09-18 @ 10:30 AM)

hgscott commented 2 months ago

In the files that Michelle sent to me (via slack) on 2024-04-23, there is a txt file that I think I can convert to a gff file.

The example GFF file with the wrong gene calls:

The TXT file I have with the correct gene calls:

The GFF file format, according to wikipedia:

Position index	Position name	Description
1	seqid	The name of the sequence where the feature is located.
2	source	The algorithm or procedure that generated the feature. This is typically the name of a software or database.
3	type	The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4	start	Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5	end	Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed]
6	score	Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7	strand	Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8	phase	phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9	attributes	A list of tag-value pairs separated by a semicolon with additional information about the feature.

So to convert to the GFF I need:

seqid: "contig" in TXT file
source: "prodigal", not sure if the "version" column in the text file refers to prodigal or not (might be for anvio itself
type: call_type in the text file is all "1", I assume that means CDS
start: start column from text file, but I think I need to ass 1 to it
end: stop column from text file, do NOT need to add 1 to it
score: can just use "."
strand: Not the same as the direction column, so I think start by doing all "."
phase: Was "." in the old one, so just do that
attributes: Fill in ID=gene_callers_id from TXT file (Do I need the organism number 273...?))

hgscott commented 2 months ago

I manually edited/made my new GFF file in excel, I saved the excel workbook in the genomes folder (not tracked on GitHub).

hgscott commented 2 months ago

Loading my new GFF file into KBase:

I put the "Genome type" as "Finished isolate", but I don't think that really affects anything
MIT1002 didn't come up in the "Scientific name" search, and it doesn't look like you can type in your own there, so I just left it blank
Left the source as "other"

Got an error:

hgscott commented 2 months ago

I tried removing the "2738...67___" from the ID in the attributes field. So the GFF looks like this:

Got this error:

hgscott commented 2 months ago

I found this website (https://genometools.org/cgi-bin/gff3validator.cgi) that checks if a GFF file is valid, that might help diagnose the problem with my file.

hgscott commented 2 months ago

When I run it on my current file, I get this message:

hgscott commented 2 months ago

To chage the extension I navidated to that folder in termainal and ran: mv 2738541267_genecalls.gff.txt 2738541267_genecalls.gff to remove the .txt extension.

hgscott commented 2 months ago

Now, I get the following error:

hgscott commented 2 months ago

Here's an example I found online of a GFF file with a ## sequence region line:

hgscott commented 2 months ago

So I added in my own ##sequence-region lines myself, I added:

##sequence-region c_000000000001 1 4633392
##sequence-region c_000000000002 1 102343

I wasn't actually sure the length of these regions, so I just took the largest number at the end of the list for each.

So now my file looks like:

hgscott commented 2 months ago

Now that is passing the validation checks!

hgscott commented 2 months ago

But it still gave me the same error:

hgscott commented 2 months ago

Try finding an example GFF & fasta file, just to see if the app works.

I tried running it with Zac's GFF file (MIT1002_gene_calls_20231115.gff) and the whole genome fasta (nucleotide) file (2738541267.fa), and that worked.

hgscott commented 2 months ago

I tried running my validated GFF and the protein seq fasta file again, and got a slightly different error message:

I think this means that in the protein sequence fa file the numbers after the > are the 0...4106, NOT c_00..001, which is the sequence id in the GFF file.

Can I run with my handmade GFF and the full nucleotide genome like I did with Zac's?

hgscott commented 2 months ago

I set up the app to use my new gene calls GFF, and the full genome:

And it worked! And I can see that there are only 4106 features, not 4116 features.

hgscott commented 2 months ago

Now that I have the genome loading into KBase correctly, here is the outline for the rest of my narrative:

Load genome
Annotate genome with RAST
Build draft model with omegga MAYBE: Add things Michelle/pangenome identified that RAST did not?
Test the producibility of each biomass component on each carbon source
Modify the biomass reaction, if needed
Gap filling
1. Define/load the media
2. Loop to do
  1. Sequential gap filling in different orders
  2. Independent gap filling
  3. Global gap filling
Growth phenotypes for all models (compare to experimental)
Problem diagnosis
1. Look for energy-generating cycles. Can I run MACAW in KBase, or can I rewrite a simpler version in a code block?

hgscott commented 2 months ago

Daniel says: I think the order looks good. Only thing I can think of is that it may be good to get rid of energy-generating cycles right away, as it may affect gap filling etc.

hgscott commented 1 month ago

I asked ChatGPT how to use a code block to replace the "Edit Metabolic Model" app, and this is the code it generated for me:

from biokbase.narrative.jobs.appmanager import AppManager

app = AppManager()

params = {
    'fbamodel_id': 'My_Model_ID',
    'fbamodel_workspace': 'My_Model_WS',
    'compounds_to_add': [
        {
            'id': 'cpd_new',
            'name': 'New Compound',
            'formula': 'C6H12O6',
            'charge': 0,
            'aliases': []
        }
    ],
    'reactions_to_add': [
        {
            'id': 'rxn_new',
            'name': 'New Reaction',
            'reaction_string': 'cpdA + cpdB => cpdC + cpd_new',
            'direction': '=',
            'gpr': '(gene1 and gene2)',
            'pathway': 'Custom Pathway',
            'reference': '',
            'enzyme': '',
            'equation': '',
            'aliases': []
        }
    ],
    'compounds_to_change': [
        {
            'compound_id': 'cpd_to_modify',
            'attribute': 'name',
            'new_value': 'Modified Compound Name'
        }
    ],
    'reactions_to_change': [],
    'compounds_to_remove': ['cpd_to_remove'],
    'reactions_to_remove': ['rxn_to_remove'],
    'biomasses_to_add': [],
    'biomasses_to_change': [],
    'biomasses_to_remove': [],
    'edit_compound_stoichiometry': [],
    'new_fbamodel_id': 'My_Edited_Model_ID',
    'workspace': 'My_Output_WS'
}

job = app.run_app('fba_tools/edit_metabolic_model', params)

# Monitor the job
import time

while job.is_running():
    print("Job is running...")
    time.sleep(10)

print("Job completed!")
output = job.result()
print(output)

C-CoMP-STC / GEM-mit1002

Use OMEGGA to make model #59