AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

Same as closed issue #29 - runs successfully but .tre file is empty when using -m flag #64

Closed kellikmullane closed 1 year ago

kellikmullane commented 1 year ago

*Re-posting here as a new issue in case you don't get notifications about comments on closed issues

Hey there! I know this is a closed issue, but I'm having the exact same problem. My run completes successfully and gives no errors/flags, but anytime I use the -m flag the .tre file is empty at the end of the run. It seems to be recognizing the file associated with the -m flag and is utilizing it, as the labels are included in the 'Genomes_summary_info.tsv' output file. If I do not use the -m flag, this issue does not occur. I saw the note under the -m argument explainer on Github saying the file being passed to -m needs to include all of the input genomes - I double checked this and they are all included.

I saw someone brought up this issue previously (closed issue #29), but theirs just magically seemed to start working so no real resolution was found. I tried installing the latest version again like you recommended in that closed issue - that did not resolve this issue for me.

I recently downloaded GToTree (running v1.6.37), and have been running it on a smaller subset of my dataset (full dataset takes ~15 hours to run on my machine). Would REALLY like to get this -m argument working for the full dataset. Would save me so much time and energy.

I've attached the runlog here for your reference.

Appreciate any help/advice you have! Thank you sincerely in advance!

gtotree-runlog.txt

AstrobioMike commented 1 year ago

hiya, @kellikmullane!

Sorry for the slow response! My email started putting my github issue notifications into spam like 30 days ago!! :( I'm glad yours was only from 3 days ago at least

Sorry you're having trouble, but thanks for letting me know about this. If there is something odd happening I'd really like to find it, of course :)

Could you possibly send me the entire GToTree_TEST/ directory your using to my email (MikeLee@bmsis.org would work) so i can do things with your exact stuff?

AstrobioMike commented 1 year ago

Also note that *not all input genomes need to be included in the -m mapping file :)

One thing you can check if you don't want to/can't send me things, we can see in the log that it's not treeing and saying "Non-unique name 'H._lutea_YIM91125' in the alignment". Is it possible you have the same wanted label for more than one input genome in the mapping file? (either way, i need to incorporate a way to catch this and tell us rather than missing it entirely like now, ha)

kellikmullane commented 1 year ago

Hey @AstrobioMike!

First off, thank you SO much for taking the time to help me out with this - you have no idea how much I appreciate it!

Based off your second message I was able to get it to successfully produce a tree with the -m flag (woohoo!), although I think there might still be something weird going on (or perhaps it's part of the design? unsure). So yes, I do have 2 input genomes that are "H. lutea YIM 91125". However, they are different genome sequences and have different accession numbers. Thus, the wanted labels I had for them included this accession number (e.g., H lutea YIM 91125 (ARKK00000000) and H. lutea YIM 91125 (BMXD00000000), respectively). So while the wanted labels are partially the same, they as a whole are unique.

I tried deleting these similarly-named lines from the -m mapping file, and that's what resolved my tree issue!

I did notice, though, that the accession numbers I had tacked on to my labels got chopped off when they were input into the tree (e.g., the label I wanted is "H. pacifica 62 (BJUK00000000)", but in the tree it just says "H. pacifica 62"). This is likely what led to my issues with the similarly-named lines in the -m mapping file. However, I would like to be able to keep the accession numbers as part of the label. Is chopping them off a design of GToTree? Or is it perhaps my use of parenthesis in a tab-delimited file? If so, do you know of a work-around for this?

More than happy to send you my GToTree_TEST/ directory if that would be helpful. Just let me know!

Again - thank you so so much for your help!

AstrobioMike commented 1 year ago

thanks, @kellikmullane!

The non-standard characters (parentheses here), are definitely what's causing the problem. That isn't anything i can help within GToTree though, many of the underlying programs won't allow them (and i suspect one of them is cutting things off as I don't have that coded anywhere i can think of). To be generally safe as much as possible, it's good to have nothing but maybe underscores and dashes in sequence headers (as any given special character might cause a problem in any given program).

I didn't have anything in here checking for this currently because inputs are either files (which already have rules that generally keep out problematic special characters), or NCBI accessions which don't have a problem. Silly me didn't realize we could hit this very situation with the mapping file though! I'm going to try to put in a check for special characters so in the future we'd get an actual useful error instead of the runaway nonsense GToTree is currently doing in this situation :)

For now with your case here, if you want to attach the mapping file or email it to me i can help adjust it to keep your accessions and convert problematic characters into something else or remove them.

You can try too if you'd like, tr might be an easy way to get the job done. E.g., doing things like tr -d ")(" < mapping.tsv > new-mapping.tsv (which will delete whatever characters we put within the quotes there).

Let me know if i can help :)

And thanks for helping find a place we can make GToTree more robust! 🎉