eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Removing input files: do I need re-run all the analysis? Or there's another way? #109

Closed hadassaloth closed 1 year ago

hadassaloth commented 1 year ago

Hello everyone, I'm Hadassa Loth, from Federal University of Rio de Janeiro, in Brazil.

I'm using Get_HOMOLOGUES software in my doctoral project, which I also used in my master project.

Now I'm analysing 30 lineages of a genus (i.e. many species of the same genus). The manual indicates me, that I have to use gbk files to generate figures like Fig. 7, 12, 13 and 18.

But I also used faa files (following the suggestions of the exercise proposed in the Chapter 14, starting on page 211), I put all gbk and faa files in the same directory, and when I got the point of generate pan and coregenome (at page 219), Get_HOMOLOGUES computed all files as 1 lineage, i.e. the analysis didn't have 30 lineages anymore, but 60 (computing all sequencies from faa and gbk files, I guess).

I also tried to reanalyze with excluding the gbk, and then faa files from the directory, but got this ERROR message: "EXIT : cannot find previous input file GCA_1234.gbk, please re-run everything" and then "EXIT : cannot find previous input file GCA_1234.faa, please re-run everything"

How can I set the pan and coregenome for 30 lineages only, working just with faa or gbk files? I will need to re-run all analysis again using just faa OR gbk files?

Could someone, please, explain how I can resolve this situation without having to re-run the entire analysis again?

Thank you in advance for your patience and attention.

Best regards

eead-csic-compbio commented 1 year ago

Hi @hadassaloth , I respond inline:

But I also used faa files (following the suggestions of the exercise proposed in the Chapter 14, starting on page 211), I put all gbk and faa files in the same directory, and when I got the point of generate pan and coregenome (at page 219), Get_HOMOLOGUES computed all files as 1 lineage, i.e. the analysis didn't have 30 lineages anymore, but 60 (computing all sequencies from faa and gbk files, I guess).

GET_HOMOLOGUES will try to extract both .fna and .faa from your input GenBank files, so adding separate .faa files usually is not required in this case.

I also tried to reanalyze with excluding the gbk, and then faa files from the directory, but got this ERROR message: "EXIT : cannot find previous input file GCA_1234.gbk, please re-run everything" and then "EXIT : cannot find previous input file GCA_1234.faa, please re-run everything"

Once you completed one analysis, the software expects to find the input files used in the 1st analysis and perhaps some new ones. But as you found out, you cannot remove them for consistency of the BLAST results.

How can I set the pan and coregenome for 30 lineages only, working just with faa or gbk files? I will need to re-run all analysis again using just faa OR gbk files?

You could use option -I to re-analyze only a subset of 30 lineages. For that you need to create a text file, named for instance selection.list , to select the files you want in (see an example in the manual). The contents should be something like:

file1.gbk
file2.gbk
...
file30.gbk

You should then re-run with something like _perl gethomologues.pl -d data -I selection.list ; please try and let us know, Bruno

hadassaloth commented 1 year ago

Dr Bruno, Your guidance worked, and I really didn't need to re-run the entire analysis. I just created a list containing only the sequences I wanted to work on and used the -I flag

Thank you very much!