AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

Alignment file missing #76

Closed morgansobol closed 11 months ago

morgansobol commented 1 year ago

Hi Mike!

I have an issue where my GToTree run seems to finish correctly based on the log file, but no alignment file is produced even though it says it was written. I get the other files in the GToTree_output/ directory such as _Genomes_removed_for_too_fewhits.tsv , _Genomes_summaryinfo.tsv , and even _GToTreeoutput.tre, but the tree file is empty.

Here is the code I am running: GToTree -f genome_list.txt -H Bacteria_and_Archaea.hmm -m mapping_file.txt -j 1

No errors were reported in gtotree-runlog.txt , but some errors were reported in my cluster standard error file _2023Apr27_2gtotree.err. Maybe it makes sense to you? I do not think it is a mapping file error (as in #64) since I only have "_" and "-" in the names, but maybe this is an issue with the /tmp directory as in #46 ?

Here this is GToTree v. 1.7.10 but I also had the same issue with v. 1.6.

Look forward to your insight! Morgan

gtotree-runlog.txt

2023Apr27_2_gtotree.err.txt

AstrobioMike commented 1 year ago

Hey there, @morgansobol!

I’m not sure what’s going on just glancing at things, unfortunately, but it’s for sure annoying it ran for hours only to leave you with empty outputs! Sorry it’s giving you trouble :/

Yea, I don’t think it’s a mapping-file problem either. But it might help if you just use like 5 genomes (so just take the top 5 from your current genome_list.txt and put them in a new input file to pass to -f), and run a quick one with just those 5 and no mapping file, but everything else the same. That will finish super-quick, and for us tracking this down, there is no harm in making it as simple as possible while still producing the problem.

There does for sure seem to be lots of weird things between the log file (log is missing some output I think, like reporting what was found for each genome during the “Working on genomes provided as fasta files” step, as it should typically print something out for each one there) and in the stderr. I think maybe it’s related to variables not being set or evaluated properly, because of things in the stderr file like when it can’t find a file like this “ /tmp/gtotree.tmp.rXUkC/_genes1.tmp”, because at the end of that, where it says “_genes1.tmp” there, for example, it’s supposed to have a gene name in front of it, like it “ RNA_pol_Rpb6_genes1.tmp”. And the “ RNA_pol_Rpb6” part would typically be stored as a variable there. And if the variable weren’t set properly, that’s exactly what we’d see: the full path but just missing the specific gene due to the variable being empty and putting nothing there…

I’m not sure what would be causing that though :/

Maybe it has something to do with the cluster sending different parts of the processing to different nodes or something, with the variables being set in one node but not another. Or maybe something else with how some clusters handle variables that I just don’t understand yet.

I’m traveling running a workshop for a few days, and won’t be able to try to poke at this for a bit. So I’m sorry I’m gonna have to leave you hanging for a little. But I will get to trying to help trouble-shooting as soon as I can

in the meantime I would be interested in you confirming we do hit the same problem if you just run it with like 5 genomes and no mapping file. I very much suspect it’ll have the same issue still

morgansobol commented 1 year ago

Thanks @AstrobioMike for the advice! I'm going to try it with fewer genomes and report back shortly.

morgansobol commented 1 year ago

Hi @AstrobioMike,

Sorry for the delay, our cluster was down for maintenance.

So, it seems that it is an issue with the mapping file, as it successfully completed with the 5 genomes without it. However, when I try to do the same run but with the mapping file, it reports that the "5 genome ID(s) listed in the mapping file (passed to "-m") not found in any of the input genomes".

Can you please take a look at the file and see if you can tell if something is wrong? I have tried giving the full path as I have in the input file list and leaving it as *.fa.fna as well. I also don't find any weird spaces or tabs. mapping_file.txt genome_list.txt

Thanks!! Morgan

AstrobioMike commented 1 year ago

I am so sorry this slipped past me, @morgansobol :(

I'm sure it's not helpful to you anymore, but it looks to me like the names aren't the same in the mapping file and the genome list.

Meaning the genome list has names like this (with a .fna at the end):

GCA_000007345.1.fa.fna

and the mapping file has names like this (without the .fna at the end):

GCA_000007345.1.fa

that would cause it to not be able to link them up

again, very sorry for dropping the ball on getting back to you!