C-CoMP-STC / GEM-mit1002

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Export the RASTtk annotations to Anvio #49

Open hgscott opened 7 months ago

hgscott commented 7 months ago

We want to be able to load the RASTTk annotations to anvio to compare the annotations used in the model to what Michelle got from the pangenome.

hgscott commented 7 months ago

Michelle says: The file form that anvio wants is a ‘functions-txt’ as they call it: https://anvio.org/help/7/artifacts/functions-txt/

hgscott commented 7 months ago

Here's an example of that file: Image

hgscott commented 7 months ago

I downloaded the RASTtk annotation results as a JSON file: Image

And I don't know how to extract the information anvio needs from this. Here's one example entry:

{"cdss":["2738541267___1448"],
"dna_sequence":"GTGCCAGATATGAAGCTCTTTGCAGGTAATGCCGTACCAGAACTTGCCCAGAAAGTTGCCGATCGCCTCTACACCAAACTTGGAAATGCCAAAGTTGGCCGTTTCAGTGACGGTGAAATCAGCGTAGAAATTCATGAAAACGTCCGTGGCTCGGACGTTTTTATTATCCAGTCTACGTGTGCGCCTACTAACGATAACCTTATGGAACTTATTGTGATGATCGACGCACTACGTCGCGCATCAGCTGGTCGTATTACAGCAGTAATTCCTTACTTTGGTTATGCACGCCAAGACCGTCGTGTTCGTTCAGCTAGGGTGCCTATTACTGCGAAAGTAGTGGCTGACTTCCTGTCTAACGTTGGTGTTGACCGCGTACTTACTATCGACCTACACGCCGAACAAATTCAAGGTTTCTTTGATGTTCCGGTGGATAACGCATTCGGTACTCCTATCCTTCTTGCTGACATGGTAAGACGTGATTTTGCCGACCCTGTAGTCGTTTCTCCTGACATTGGCGGTGTTGTACGTGCACGTGCTACTGCGAAACTACTTAACGACACCGACCTTGCCATTATCGATAAGCGTCGCCCTAAAGCGAACGTGGCTCAGGTAATGAACATCATTGGTGACGTAAAAGACAGAGACTGCATCATTGTCGATGACATGATTGATACAGGCGGTACGCTTGCAAAAGCAGCTGAAGCACTTAAAGCACATGGTGCGCGTCGTGTTTATGCTTACGCAACTCACGCTATCTTTTCAGGTAACGCTGCAAACAATCTTAAAGAGTCTGTTATTGACGAAATTATCGTTACCGACTCTATCCCATTAAGTGCAGAGATGAAGCAAATTGGAAAAGTAAAACAGCTTACTTTATCTGAGATGCTTGCAGAAACTATTCGTCGCATCAGCAACGAAGAGTCTATTTCAGCAATGTTTGAATACTAA",
"dna_sequence_length":948,
"functions":["Ribose-phosphate pyrophosphokinase (EC 2.7.6.1)"],
"id":"2738541267___1448_gene",
"location":[["c_000000000001",1661235,"-",948]],
"md5":"778f5e2ac9be5f272cee4afbd76ce0a4",
"ontology_terms":{"SSO":{"SSO:000007113":[2]}},
"protein_md5":"778f5e2ac9be5f272cee4afbd76ce0a4",
"protein_translation":"MPDMKLFAGNAVPELAQKVADRLYTKLGNAKVGRFSDGEISVEIHENVRGSDVFIIQSTCAPTNDNLMELIVMIDALRRASAGRITAVIPYFGYARQDRRVRSARVPITAKVVADFLSNVGVDRVLTIDLHAEQIQGFFDVPVDNAFGTPILLADMVRRDFADPVVVSPDIGGVVRARATAKLLNDTDLAIIDKRRPKANVAQVMNIIGDVKDRDCIIVDDMIDTGGTLAKAAEALKAHGARRVYAYATHAIFSGNAANNLKESVIDEIIVTDSIPLSAEMKQIGKVKQLTLSEMLAETIRRISNEESISAMFEY",
"protein_translation_length":315,
"quality":{"hit_count":275,
                 "weighted_hit_count":821.610657},
"warnings":["This gene was not in the source GenBank or GFF file. It was added to be the parent of a CDS."]}
hgscott commented 7 months ago

I looked at using the model object directly: Image

But the gene ID isn't tied to any actual sequence data: Image

hgscott commented 7 months ago

Even in the full model file, the the genes aren't tied to any sequence information: Image

hgscott commented 7 months ago

I think I could cobble together a file where source is "RASTtk", the accession is the cdss ID at the start of the entry, the function is "functions" and the e_value is arbitrary. And Michelle can get me the gene_callers_id by using anvi-get-sequences-for-gene-calls.

But my problem with this is that it doesn't tell me the ModelSEED reaction ID (i.e. "rxn00015"), that's what I really want.

hgscott commented 7 months ago

I started doing a deep dive on the KBase code for building the models to figure out how KBase is going from the RAST annotation to the ModelSEED IDs, and I think that is best tackled in a separate issue.

hgscott commented 7 months ago

Osnat says that there is an option to retain the original annotations (during the re-annotation step).