DyogenIBENS / Agora

Algorithm For Gene Order Reconstruction in Ancestors
Other
70 stars 15 forks source link

Error Running agora-basic.py: "assert oldName not in seen" #23

Closed erin-thei closed 1 year ago

erin-thei commented 1 year ago

Hello,

I am trying to run Agora using my own data (the example worked with no issues). This is the command I tried to run: ~/Agora/src/agora-basic.py species-tree.nwk orthologyGroups/orthologyGroups.%s.list genes/genes.%s.list

(agora) [theillere@Escalante3 Single_Copy_Orthologue_Sequences]$ ~/Agora/src/agora-basic.py species-(agora) [theillere@Escalante3 Single_Copy_Orthologue_Sequences]$ ~/Agora/src/agora-basic.py species-(agora) [theillere@Escalante3 Single_Copy_Orthologue_Sequences]$ ~/Agora/src/agora-basic.py species-tree.nwk orthologyGroups/orthologyGroups.%s.list genes/genes.%s.list

| Key | Values |

| speciesTree | species-tree.nwk | | geneTrees|orthologyGroups | orthologyGroups/orthologyGroups.%s.list | | genes | genes/genes.%s.list | | target | | | extantSpeciesFilter | | | compress | bz2 | | workingDir | . | | nbThreads | 24 | | forceRerun | False | | sequential | True |

New task 0 ('ancgenes', 'all') [] Command(args=['/home/theillere/Agora/src/ALL.reformatGeneFamilies.py', 'species-tree.nwk', 'orthologyGroups/orthologyGroups.%s.list', '-IN.genesFiles=genes/genes.%s.list', '-OUT.ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-OUT.genesFiles=genes/genes.%s.list.bz2'], out='GeneTreeForest.withAncGenes.nhx.bz2', log='ancGenes/ancGenes.log')

New task 1 ('pairwise', 'ancgenes-all') [('ancgenes', 'all')] Command(args=['/home/theillere/Agora/src/buildSynteny.pairwise-conservedPairs.py', 'species-tree.nwk', 'NAME_0', '-ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-genesFiles=genes/genes.%s.list.bz2', '-OUT.pairwise=pairwise/pairs-all/%s.list.bz2'], out=None, log='pairwise/pairs-all/log')

New task 2 ('integr', 'denovo-all') [('pairwise', 'ancgenes-all')] Command(args=['/home/theillere/Agora/src/buildSynteny.integr-denovo.py', 'species-tree.nwk', 'NAME_0', '+searchLoops', '-OUT.ancBlocks=ancBlocks/denovo-all/blocks.%s.list.bz2', 'pairwise/pairs-all/%s.list.bz2', '-ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-LOG.ancGraph=ancBlocks/denovo-all/graph.%s.txt.bz2'], out=None, log='ancBlocks/denovo-all/log')

New task 3 ('integr', 'denovo-all.scaffolds') [('integr', 'denovo-all')] Command(args=['/home/theillere/Agora/src/buildSynteny.integr-scaffolds.py', 'species-tree.nwk', 'NAME_0', '-OUT.ancBlocks=ancBlocks/denovo-all.scaffolds/blocks.%s.list.bz2', '-ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-IN.ancBlocks=ancBlocks/denovo-all/blocks.%s.list.bz2', '-genesFiles=genes/genes.%s.list.bz2', '-LOG.ancGraph=ancBlocks/denovo-all.scaffolds/graph.%s.txt.bz2'], out=None, log='ancBlocks/denovo-all.scaffolds/log')

New task 4 ('conversion', 'basic-workflow') [('integr', 'denovo-all.scaffolds')] Command(args=['/home/theillere/Agora/src/convert.ancGenomes.blocks-to-genes.py', 'species-tree.nwk', 'NAME_0', '+orderBySize', '-IN.ancBlocks=ancBlocks/denovo-all.scaffolds/blocks.%s.list.bz2', '-ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-OUT.ancGenomes=ancGenomes/basic-workflow/ancGenome.%s.list.bz2'], out=None, log='ancGenomes/basic-workflow/log')

Status: 5 to do, 0 running, 0 done, 0 failed -- 5 total Available tasks: [0] Control file ancGenes/ancGenes.log.agora missing Launching task 0 ['/home/theillere/Agora/src/ALL.reformatGeneFamilies.py', 'species-tree.nwk', 'orthologyGroups/orthologyGroups.%s.list', '-IN.genesFiles=genes/genes.%s.list', '-OUT.ancGenesFiles=ancGenes/all/ancGenes.%s.list.bz2', '-OUT.genesFiles=genes/genes.%s.list.bz2'] > GeneTreeForest.withAncGenes.nhx.bz2 2> ancGenes/ancGenes.log Status: 4 to do, 1 running, 0 done, 0 failed -- 5 total Waiting ... task 0 report: 0.106603 sec CPU time / 0.107803 sec elapsed = 98.8865% CPU usage, 17.625 MB RAM task 0 is now finished (status 1)

Inspect ancGenes/ancGenes.log for more information Status: 4 to do, 0 running, 0 done, 1 failed -- 5 total Available tasks: [] Workflow stopped because of failures Workflow report: 0.114315 sec CPU time / 0.115183 sec elapsed = 99.2463% CPU usage, 18.0391 MB RAM (agora) [theillere@Escalante3 Single_Copy_Orthologue_Sequences]$

Here is the input data that I'm working with: https://www.dropbox.com/scl/fo/en4rlnwvvnspv9sj51d3u/h?dl=0&rlkey=ybt2vi7hi09xfgnp2uuw85oz7

Please let me know if you have any insight as to how I can solve this issue. I'm also attaching the log file. Thanks!

Agora_Log.txt

alouis72 commented 1 year ago

Hi @erin-thei , The format of the orthogroups files is not good. There should not have the first line, lines should be only list of genes, with no comma. I guess you used Orthofinder to generate these HOGs. You can try to use the script I wrote on the agora_dev branch in src/import : https://github.com/DyogenIBENS/Agora/blob/dev/src/import/orthofinder_hogs/convert_hogs_sp.py

I didn't get the opportunity to try it through all the ancestral reconstruction process, therefore, I would greatly appreciate it if you could provide me feedbacks on that.

erin-thei commented 1 year ago

Hi @alouis72,

Thanks for your timely response. I will give that a try!

Since I'm new to this workflow, a couple of questions. Given my species tree, I was told to run OrthoFinder on all of the nodes (so I ran 68 iterations of OF). Each of those OF runs produced their own HOGs. Am I supposed to use that script for all of those? I guess I am a bit confused on the ancestral reconstruction process as a whole. Any help would be much appreciated. Thanks!

erin-thei commented 1 year ago

Hi again @alouis72 ,

I was able to get past the error I was facing earlier, but I got an error during the buildSynteny.pairwise-conservedPairs.py step saying: No such file or directory: 'ancGenes/all/ancGenes.NAME_0.list.bz2. Upon inspecting the scripts, I printed phylTree.listAncestr:

['A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A2', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A3', 'A30', 'A31', 'A32', 'A33', 'A34', 'A35', 'A36', 'A37', 'A38', 'A39', 'A4', 'A40', 'A41', 'A42', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A5', 'A50', 'A51', 'A52', 'A53', 'A54', 'A55', 'A56', 'A57', 'A58', 'A59', 'A6', 'A60', 'A61', 'A62', 'A63', 'A64', 'A65', 'A66', 'A67', 'A68', 'A7', 'A8', 'A9', 'NAME_0']

Why is that last ancestor listed when it's not present in my species tree?

alouis72 commented 1 year ago

Hi Erin, The root of the species tree has no name, so AGORA infer it as NAME_0, but... do not have OrthoGroups for it. Either you name and give orthogroups for the root (if you have them), or you add an option "-target=A2" to the agora command line to build ancestor A2 and its descendants.

About, your first question... I don't understand how you build your OrthoGroups. Maybe there is a risk of inconsistancy between ancestors... I know that Orthofinder2 build Hierarchical Orthogroups (Phylogenetic_Hierarchical_Orthogroups in results), with consistency across the species tree. Maybe you should try that.

erin-thei commented 1 year ago

Great, thanks for the information. I was actually able to fix the issue prior to your response, and get it working successfully which is great.

I haven't done a deep dive into the results yet, or how to interpret them, but does Agora report the average number of genes per synteny block? Or is that something that should be done manually?

Thanks so much for your help!