C-CoMP-STC / GEM-mit1002

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Use OMEGGA to make model #59

Open hgscott opened 7 months ago

hgscott commented 7 months ago

Ilija shared a narrative using the new OMEGGA app for annotating.making a model.

hgscott commented 2 months ago

My current narrative for making the Amac model is: "MIT1002 RAST MS2 beta"

hgscott commented 2 months ago

Things I need to figure out:

hgscott commented 2 months ago

What is the correct genome?

hgscott commented 2 months ago

Uploading that file to KBase:

I think I just need a GFF file, I will ask Rogier/student/Sam. (Update: Asked all three of them via slack on 2024-09-18 @ 10:30 AM)

hgscott commented 2 months ago

In the files that Michelle sent to me (via slack) on 2024-04-23, there is a txt file that I think I can convert to a gff file.

The example GFF file with the wrong gene calls: Image

The TXT file I have with the correct gene calls: Image

The GFF file format, according to wikipedia:

Position index Position name Description
1 seqid The name of the sequence where the feature is located.
2 source The algorithm or procedure that generated the feature. This is typically the name of a software or database.
3 type The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4 start Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5 end Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed]
6 score Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7 strand Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8 phase phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9 attributes A list of tag-value pairs separated by a semicolon with additional information about the feature.

So to convert to the GFF I need:

hgscott commented 2 months ago

I manually edited/made my new GFF file in excel, I saved the excel workbook in the genomes folder (not tracked on GitHub).

hgscott commented 2 months ago

Loading my new GFF file into KBase:

Got an error: Image

hgscott commented 2 months ago

I tried removing the "2738...67___" from the ID in the attributes field. So the GFF looks like this: Image

Got this error: Image

hgscott commented 2 months ago

I found this website (https://genometools.org/cgi-bin/gff3validator.cgi) that checks if a GFF file is valid, that might help diagnose the problem with my file.

hgscott commented 2 months ago

When I run it on my current file, I get this message: image

hgscott commented 2 months ago

To chage the extension I navidated to that folder in termainal and ran: mv 2738541267_genecalls.gff.txt 2738541267_genecalls.gff to remove the .txt extension.

hgscott commented 2 months ago

Now, I get the following error: image

hgscott commented 2 months ago

Here's an example I found online of a GFF file with a ## sequence region line:

image
hgscott commented 2 months ago

So I added in my own ##sequence-region lines myself, I added:

##sequence-region c_000000000001 1 4633392
##sequence-region c_000000000002 1 102343

I wasn't actually sure the length of these regions, so I just took the largest number at the end of the list for each.

So now my file looks like: image

hgscott commented 2 months ago

Now that is passing the validation checks! image

hgscott commented 2 months ago

But it still gave me the same error: image

hgscott commented 2 months ago

Try finding an example GFF & fasta file, just to see if the app works.

I tried running it with Zac's GFF file (MIT1002_gene_calls_20231115.gff) and the whole genome fasta (nucleotide) file (2738541267.fa), and that worked. image image

hgscott commented 2 months ago

I tried running my validated GFF and the protein seq fasta file again, and got a slightly different error message: image

I think this means that in the protein sequence fa file the numbers after the > are the 0...4106, NOT c_00..001, which is the sequence id in the GFF file.

Can I run with my handmade GFF and the full nucleotide genome like I did with Zac's?

hgscott commented 2 months ago

I set up the app to use my new gene calls GFF, and the full genome: image

And it worked! image And I can see that there are only 4106 features, not 4116 features.

hgscott commented 2 months ago

Now that I have the genome loading into KBase correctly, here is the outline for the rest of my narrative:

  1. Load genome
  2. Annotate genome with RAST
  3. Build draft model with omegga MAYBE: Add things Michelle/pangenome identified that RAST did not?
  4. Test the producibility of each biomass component on each carbon source
  5. Modify the biomass reaction, if needed
  6. Gap filling
    1. Define/load the media
    2. Loop to do
      1. Sequential gap filling in different orders
      2. Independent gap filling
      3. Global gap filling
  7. Growth phenotypes for all models (compare to experimental)
  8. Problem diagnosis
    1. Look for energy-generating cycles. Can I run MACAW in KBase, or can I rewrite a simpler version in a code block?
hgscott commented 2 months ago

Daniel says: I think the order looks good. Only thing I can think of is that it may be good to get rid of energy-generating cycles right away, as it may affect gap filling etc.

hgscott commented 1 month ago

I asked ChatGPT how to use a code block to replace the "Edit Metabolic Model" app, and this is the code it generated for me:

from biokbase.narrative.jobs.appmanager import AppManager

app = AppManager()

params = {
    'fbamodel_id': 'My_Model_ID',
    'fbamodel_workspace': 'My_Model_WS',
    'compounds_to_add': [
        {
            'id': 'cpd_new',
            'name': 'New Compound',
            'formula': 'C6H12O6',
            'charge': 0,
            'aliases': []
        }
    ],
    'reactions_to_add': [
        {
            'id': 'rxn_new',
            'name': 'New Reaction',
            'reaction_string': 'cpdA + cpdB => cpdC + cpd_new',
            'direction': '=',
            'gpr': '(gene1 and gene2)',
            'pathway': 'Custom Pathway',
            'reference': '',
            'enzyme': '',
            'equation': '',
            'aliases': []
        }
    ],
    'compounds_to_change': [
        {
            'compound_id': 'cpd_to_modify',
            'attribute': 'name',
            'new_value': 'Modified Compound Name'
        }
    ],
    'reactions_to_change': [],
    'compounds_to_remove': ['cpd_to_remove'],
    'reactions_to_remove': ['rxn_to_remove'],
    'biomasses_to_add': [],
    'biomasses_to_change': [],
    'biomasses_to_remove': [],
    'edit_compound_stoichiometry': [],
    'new_fbamodel_id': 'My_Edited_Model_ID',
    'workspace': 'My_Output_WS'
}

job = app.run_app('fba_tools/edit_metabolic_model', params)

# Monitor the job
import time

while job.is_running():
    print("Job is running...")
    time.sleep(10)

print("Job completed!")
output = job.result()
print(output)