Alignment for 'Bacteria' SCG set is only 169 AA long

emily-ebel commented 2 years ago

Hi there,

I ran GToTree on my lab's Linux cluster for ~4600 UHGG genomes in fasta format, using the Bacteria SCG.

GToTree -f paths_to_fastas.txt -H Bacteria -o GToTree -j 4

The majority of these genomes passed default checks (including >50% of SCG hits), and all the expected output files were created. Aligned_SCGs.faa seemed pretty small, though -- only 169 AA for what's meant to be 74 genes. About 400 taxa also have only gaps in that short alignment, even though they passed completeness filtering. Maybe unsurprisingly, the resulting tree has phlya all scrambled. I'm wondering if the rest of the alignment somehow got lost from the output file? I can re-run to keep the temporary directory and look for bugs, but wanted to post this in case I'm not the only one.

Thanks for making such a flexible and easy-to-use tool.

emily-ebel commented 2 years ago

Also, tried to use -X flag in case it was an issue with super5, but no luck there either:

$ GToTree -X -f paths_to_fastas.txt -H Bacteria -o GToTree_v3 -j 4

                                  GToTree v1.6.32
                         (github.com/AstrobioMike/GToTree)

  Invalid argument: -X

AstrobioMike commented 2 years ago

Hey there, Emily :)

That definitely sounds like something is wrong :/ Sorry it's giving you trouble, but thanks for writing in about it so I can try to track this down!

Can you do a quick run with just the first 10 or so and see if that comes out as expected?

And could you also possibly let me know the easiest way I could download these genomes, if you've already worked that out? I haven't worked with the UHGG before, and I want to be able to exactly try to recreate this

PS - looking into the -X problem currently too

Thanks!

AstrobioMike commented 2 years ago

Sure enough, the -X option was missing from the argument parser list. Thank you! That's fixed in the repo now, but i'm going to wait until I find out what's going on with the primary problem here before I release it and update in conda.

emily-ebel commented 2 years ago

Hi Mike,

Thanks so much for the speedy response!

I did the run with 2 different sets of 10 genomes, which produced alignments with ~12,000 and ~13,000 positions, respectively. Definitely seems better! (But is that variability expected?)

To download the UHGG genomes, I followed the instructions in the inStrain tutorial. In case the parsing has changed, I'm also including a list of the Genome_accession numbers, which you could use to pull the FTP_download paths from genomes-all_metadata.

UHGG_list.txt

Thanks as well for fixing the -X issue. If needed, I could cut down the genome list, but I'll wait to see what you come up with :)

emily-ebel commented 2 years ago

I also meant to add that the the tmp folder for the original run has some possible clues. Using GrpE as an example:

GrpE_aligned.tmp has a relatively long alignment, 248 characters GrpE_hits.faa, GrpE_hits_filtered.faa, and GrpE_hits_filtered.tmp also have longer sequences, 184-220 characters per genome GrpE_all_aligned.faa, GrpE.faa.tmp, GrpE_formatted.faa.tmp, and GrpE_trimmed.faa.tmp are all cut down to 139 characters.

AstrobioMike commented 2 years ago

Thanks so much for the info, Emily! I'm testing now to hopefully find I get the same odd outcome, and can then track it down :)

Regarding this:

I did the run with 2 different sets of 10 genomes, which produced alignments with ~12,000 and ~13,000 positions, respectively. Definitely seems better! (But is that variability expected?)

Yep, it's not unusual to have variability in final alignment length with different input genomes. There are several steps in the process wherein the input data affect the outcome, rather than it solely being driven by our selected HMM SCG-set 👍

Thanks again, I'll update soon :)

AstrobioMike commented 2 years ago

Unfortunately i wasn't able to recreate the problem @emily-ebel :/

Things finish for me normally after giving all 4,644 input genomes. It drops a little over 400 due to having too few hits, and the resulting concatenated alignment file at the end is over 10,000 amino acids.

I'm not sure how to try to pursue what's happening on your end further. I know it's a pain, as it likely took quite a while to run with 4 cpus (took me 2.5 hours with 50), but if you can and want to run it again with the -d option like you noted to keep the intermediate files, and then you send me that, the input paths_to_fastas.txt file, and the output directory, I might be able to spot what's happening.

Or you could even start with just sending me the runlog that is maintained with any normal run, as you might still have that from your full run. I might be able to spot something in that.

I'm having a hard time imagining what might cause this type of thing. Maybe there is some dependency of a dependency functioning differently, and I don't have a check for it, and/or am not requiring the exact version, and in some rare cases it installs a problematic version. But I really have no idea yet

Sorry I can't recreate it, I tried on two different ubuntu systems I have access to :/

You can send anything you're willing to send to MikeLee\<at>bmsis\<dot>org :)

emily-ebel commented 2 years ago

Hi Mike,

Thanks so much for giving this a go; I really appreciate it.

Since it worked normally for you, I gave it one more hopeful try, and -- this time, full alignment and sensible tree! (It did take twice as long as the first failed run.) I still don't know what the first issue was, but it might have been some kind of blip or disconnect on my server. Really sorry for the wild goose chase. Again, thanks very much and hope this didn't waste too much of your time!

Best, Emily

AstrobioMike commented 2 years ago

Oh, great! It's super scary when things finish "successfully", but something is wrong. That's the worst kind of bug! I'm so glad you caught it

I'd appreciate if you could still send me both run logs like you mention – just in case there is something that can be added to catch whatever weird thing happened the first time. Even if it's rare, that's definitely no bueno

Either way you helped me catch the -X problem. So thanks already! 🙂

AstrobioMike / GToTree

Alignment for 'Bacteria' SCG set is only 169 AA long #51