Open hgscott opened 7 months ago
My current narrative for making the Amac model is: "MIT1002 RAST MS2 beta"
Things I need to figure out:
What is the correct genome?
make_model.py
is "MIT1002_anvio_prot_seqs.fa", saved in the genome folder (not pushed because of size)Uploading that file to KBase:
I think I just need a GFF file, I will ask Rogier/student/Sam. (Update: Asked all three of them via slack on 2024-09-18 @ 10:30 AM)
In the files that Michelle sent to me (via slack) on 2024-04-23, there is a txt file that I think I can convert to a gff file.
The example GFF file with the wrong gene calls:
The TXT file I have with the correct gene calls:
The GFF file format, according to wikipedia:
Position index | Position name | Description |
---|---|---|
1 | seqid | The name of the sequence where the feature is located. |
2 | source | The algorithm or procedure that generated the feature. This is typically the name of a software or database. |
3 | type | The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project. |
4 | start | Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED. |
5 | end | Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed] |
6 | score | Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value. |
7 | strand | Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands. |
8 | phase | phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation. |
9 | attributes | A list of tag-value pairs separated by a semicolon with additional information about the feature. |
So to convert to the GFF I need:
I manually edited/made my new GFF file in excel, I saved the excel workbook in the genomes folder (not tracked on GitHub).
Loading my new GFF file into KBase:
Got an error:
I tried removing the "2738...67___" from the ID in the attributes field. So the GFF looks like this:
Got this error:
I found this website (https://genometools.org/cgi-bin/gff3validator.cgi) that checks if a GFF file is valid, that might help diagnose the problem with my file.
When I run it on my current file, I get this message:
To chage the extension I navidated to that folder in termainal and ran:
mv 2738541267_genecalls.gff.txt 2738541267_genecalls.gff
to remove the .txt extension.
Now, I get the following error:
Here's an example I found online of a GFF file with a ## sequence region
line:
So I added in my own ##sequence-region
lines myself, I added:
##sequence-region c_000000000001 1 4633392
##sequence-region c_000000000002 1 102343
I wasn't actually sure the length of these regions, so I just took the largest number at the end of the list for each.
So now my file looks like:
Now that is passing the validation checks!
But it still gave me the same error:
Try finding an example GFF & fasta file, just to see if the app works.
I tried running it with Zac's GFF file (MIT1002_gene_calls_20231115.gff) and the whole genome fasta (nucleotide) file (2738541267.fa), and that worked.
I tried running my validated GFF and the protein seq fasta file again, and got a slightly different error message:
I think this means that in the protein sequence fa file the numbers after the >
are the 0...4106, NOT c_00..001, which is the sequence id in the GFF file.
Can I run with my handmade GFF and the full nucleotide genome like I did with Zac's?
I set up the app to use my new gene calls GFF, and the full genome:
And it worked! And I can see that there are only 4106 features, not 4116 features.
Now that I have the genome loading into KBase correctly, here is the outline for the rest of my narrative:
Daniel says: I think the order looks good. Only thing I can think of is that it may be good to get rid of energy-generating cycles right away, as it may affect gap filling etc.
I asked ChatGPT how to use a code block to replace the "Edit Metabolic Model" app, and this is the code it generated for me:
from biokbase.narrative.jobs.appmanager import AppManager
app = AppManager()
params = {
'fbamodel_id': 'My_Model_ID',
'fbamodel_workspace': 'My_Model_WS',
'compounds_to_add': [
{
'id': 'cpd_new',
'name': 'New Compound',
'formula': 'C6H12O6',
'charge': 0,
'aliases': []
}
],
'reactions_to_add': [
{
'id': 'rxn_new',
'name': 'New Reaction',
'reaction_string': 'cpdA + cpdB => cpdC + cpd_new',
'direction': '=',
'gpr': '(gene1 and gene2)',
'pathway': 'Custom Pathway',
'reference': '',
'enzyme': '',
'equation': '',
'aliases': []
}
],
'compounds_to_change': [
{
'compound_id': 'cpd_to_modify',
'attribute': 'name',
'new_value': 'Modified Compound Name'
}
],
'reactions_to_change': [],
'compounds_to_remove': ['cpd_to_remove'],
'reactions_to_remove': ['rxn_to_remove'],
'biomasses_to_add': [],
'biomasses_to_change': [],
'biomasses_to_remove': [],
'edit_compound_stoichiometry': [],
'new_fbamodel_id': 'My_Edited_Model_ID',
'workspace': 'My_Output_WS'
}
job = app.run_app('fba_tools/edit_metabolic_model', params)
# Monitor the job
import time
while job.is_running():
print("Job is running...")
time.sleep(10)
print("Job completed!")
output = job.result()
print(output)
Ilija shared a narrative using the new OMEGGA app for annotating.making a model.