Open hgscott opened 7 months ago
Michelle says: The file form that anvio wants is a ‘functions-txt’ as they call it: https://anvio.org/help/7/artifacts/functions-txt/
Here's an example of that file:
I downloaded the RASTtk annotation results as a JSON file:
And I don't know how to extract the information anvio needs from this. Here's one example entry:
{"cdss":["2738541267___1448"],
"dna_sequence":"GTGCCAGATATGAAGCTCTTTGCAGGTAATGCCGTACCAGAACTTGCCCAGAAAGTTGCCGATCGCCTCTACACCAAACTTGGAAATGCCAAAGTTGGCCGTTTCAGTGACGGTGAAATCAGCGTAGAAATTCATGAAAACGTCCGTGGCTCGGACGTTTTTATTATCCAGTCTACGTGTGCGCCTACTAACGATAACCTTATGGAACTTATTGTGATGATCGACGCACTACGTCGCGCATCAGCTGGTCGTATTACAGCAGTAATTCCTTACTTTGGTTATGCACGCCAAGACCGTCGTGTTCGTTCAGCTAGGGTGCCTATTACTGCGAAAGTAGTGGCTGACTTCCTGTCTAACGTTGGTGTTGACCGCGTACTTACTATCGACCTACACGCCGAACAAATTCAAGGTTTCTTTGATGTTCCGGTGGATAACGCATTCGGTACTCCTATCCTTCTTGCTGACATGGTAAGACGTGATTTTGCCGACCCTGTAGTCGTTTCTCCTGACATTGGCGGTGTTGTACGTGCACGTGCTACTGCGAAACTACTTAACGACACCGACCTTGCCATTATCGATAAGCGTCGCCCTAAAGCGAACGTGGCTCAGGTAATGAACATCATTGGTGACGTAAAAGACAGAGACTGCATCATTGTCGATGACATGATTGATACAGGCGGTACGCTTGCAAAAGCAGCTGAAGCACTTAAAGCACATGGTGCGCGTCGTGTTTATGCTTACGCAACTCACGCTATCTTTTCAGGTAACGCTGCAAACAATCTTAAAGAGTCTGTTATTGACGAAATTATCGTTACCGACTCTATCCCATTAAGTGCAGAGATGAAGCAAATTGGAAAAGTAAAACAGCTTACTTTATCTGAGATGCTTGCAGAAACTATTCGTCGCATCAGCAACGAAGAGTCTATTTCAGCAATGTTTGAATACTAA",
"dna_sequence_length":948,
"functions":["Ribose-phosphate pyrophosphokinase (EC 2.7.6.1)"],
"id":"2738541267___1448_gene",
"location":[["c_000000000001",1661235,"-",948]],
"md5":"778f5e2ac9be5f272cee4afbd76ce0a4",
"ontology_terms":{"SSO":{"SSO:000007113":[2]}},
"protein_md5":"778f5e2ac9be5f272cee4afbd76ce0a4",
"protein_translation":"MPDMKLFAGNAVPELAQKVADRLYTKLGNAKVGRFSDGEISVEIHENVRGSDVFIIQSTCAPTNDNLMELIVMIDALRRASAGRITAVIPYFGYARQDRRVRSARVPITAKVVADFLSNVGVDRVLTIDLHAEQIQGFFDVPVDNAFGTPILLADMVRRDFADPVVVSPDIGGVVRARATAKLLNDTDLAIIDKRRPKANVAQVMNIIGDVKDRDCIIVDDMIDTGGTLAKAAEALKAHGARRVYAYATHAIFSGNAANNLKESVIDEIIVTDSIPLSAEMKQIGKVKQLTLSEMLAETIRRISNEESISAMFEY",
"protein_translation_length":315,
"quality":{"hit_count":275,
"weighted_hit_count":821.610657},
"warnings":["This gene was not in the source GenBank or GFF file. It was added to be the parent of a CDS."]}
I looked at using the model object directly:
But the gene ID isn't tied to any actual sequence data:
Even in the full model file, the the genes aren't tied to any sequence information:
I think I could cobble together a file where source
is "RASTtk", the accession
is the cdss ID at the start of the entry, the function
is "functions" and the e_value
is arbitrary. And Michelle can get me the gene_callers_id
by using anvi-get-sequences-for-gene-calls
.
But my problem with this is that it doesn't tell me the ModelSEED reaction ID (i.e. "rxn00015"), that's what I really want.
I started doing a deep dive on the KBase code for building the models to figure out how KBase is going from the RAST annotation to the ModelSEED IDs, and I think that is best tackled in a separate issue.
Osnat says that there is an option to retain the original annotations (during the re-annotation step).
We want to be able to load the RASTTk annotations to anvio to compare the annotations used in the model to what Michelle got from the pangenome.