faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

phyluce_align_get_only_loci_with_min_taxa #261

Closed solenopsis1840 closed 2 years ago

solenopsis1840 commented 2 years ago

Hi, I've noticed that when I repeatedly run phyluce_align_get_only_loci_with_min_taxa, I get greatly different numbers of loci in the output (often greater than two-fold difference!). This is obviously worrisome, I thought the result should always be the same given identical input?
I am running phyluce 1.7.1 build py36_0 on our Linux cluster. Thanks! Dietrich

brantfaircloth commented 2 years ago

This is not the expected behavior. That said, this code is tested with every change to ensure the same set of loci are extracted with every run.

Would you be willing to send me the files (or a subset of them) to see if I can reproduce the behavior you are seeing?

solenopsis1840 commented 2 years ago

Of course, let me compress it and put it on dropbox. I'll email you the link. Thanks so much!

brantfaircloth commented 2 years ago

sounds good. I probably can’t look at this until Monday.

On Dec 10, 2021, at 08:06, solenopsis1840 @.***> wrote:

 Of course, let me compress it and put it on dropbox. I'll email you the link. Thanks so much!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

brantfaircloth commented 2 years ago

I don't see an issue here - I get the same output when running the code in single-threaded mode and multi-threaded mode (across 2 different multi-threaded runs).

So, running phyluce 1.7.1 in a conda environment, w/ single-threading,:

# run the code w/ 1 core
python ~/Git/phyluce/bin/align/phyluce_align_get_only_loci_with_min_taxa \
--alignments acropyga-nex-trim-clean \
--taxa 126 \
--percent 0.5 \
--output acropyga-nex-trim-clean-5p-a \
--input-format nexus

Copied 901 alignments of 1967 total containing ≥ 0.5 proportion of taxa (n = 63)

# count alignments in the output directory
ls acropyga-nex-trim-clean-5p-a/ | wc -l
901

Then running w/ multiple cores in same environment, placing output in a new directory "-b":

python ~/Git/phyluce/bin/align/phyluce_align_get_only_loci_with_min_taxa \
--alignments acropyga-nex-trim-clean \
--taxa 126 \
--percent 0.5 \
--output acropyga-nex-trim-clean-5p-b \
--input-format nexus \
--cores 4

# count the number of alignments output
Copied 901 alignments of 1967 total containing ≥ 0.5 proportion of taxa (n = 63)
ls acropyga-nex-trim-clean-5p-b/ | wc -l
901

Run the same multicore command again, outputting to a third directory:

python ~/Git/phyluce/bin/align/phyluce_align_get_only_loci_with_min_taxa \
--alignments acropyga-nex-trim-clean \
--taxa 126 \
--percent 0.5 \
--output acropyga-nex-trim-clean-5p-c \
--input-format nexus \
--cores 4

# count the number of alignments output
ls acropyga-nex-trim-clean-5p-c/ | wc -l
901

Now, run a diff on the "-a" versus "-b":

diff acropyga-nex-trim-clean-5p-a acropyga-nex-trim-clean-5p-b
[no result because they're the same]

Now, run a diff on the "-a" versus "-c":

diff acropyga-nex-trim-clean-5p-a acropyga-nex-trim-clean-5p-c
[no result because they're the same]

To make sure diff is working as expected, create a file in "-b" and run diff again, ensuring we detect the different file:

# create a file in b that's not in a
touch acropyga-nex-trim-clean-5p-b/test.txt

# run the diff again
diff acropyga-nex-trim-clean-5p-a acropyga-nex-trim-clean-5p-b
Only in acropyga-nex-trim-clean-5p-b: test.txt

As a final test, compute MD5s on "-a" and "-c" and compare those:

find ./acropyga-nex-trim-clean-5p-a -type f -exec md5sum {} + | sort -k 2 | cut -f1 -d" " > dir-a.txt
find ./acropyga-nex-trim-clean-5p-c -type f -exec md5sum {} + | sort -k 2 | cut -f1 -d" " > dir-c.txt
diff -u dir-a.txt dir-c.txt
[no result because they're the same]

That said, there are several alignments in those that you sent me that are/were empty which could be causing problems, e.g.:

find $PWD/acropyga-nex-trim-clean -type f -empty
acropyga-nex-trim-clean/uce-12938.nexus
acropyga-nex-trim-clean/uce-12474.nexus
acropyga-nex-trim-clean/uce-12998.nexus

It's possible that something strange is going on with these files... but in a standard phyluce run w/ a single or multiple cores, the code will throw an error and stop executing, e.g. when run w/ --cores 4:

Traceback (most recent call last):
  File "/Users/bcf/miniconda3/envs/phyluce-devel/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Users/bcf/miniconda3/envs/phyluce-devel/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/bcf/Git/phyluce/bin/align/phyluce_align_get_only_loci_with_min_taxa", line 91, in copy_over_files
    aln = AlignIO.read(file, format)
  File "/Users/bcf/miniconda3/envs/phyluce-devel/lib/python3.6/site-packages/Bio/AlignIO/__init__.py", line 392, in read
    raise ValueError("No records found in handle") from None
ValueError: No records found in handle

It could be that the way that you are running the job is not catching (or recording in stdout or stderr) that this error occurs, so you are seeing partial results output when you run the code (e.g. it runs until one core hits the error). The reason that you might be seeing different results each time is that the list is not always processed in the same order or at the same speed for each list element... so the errors can pop up at different times and stop the execution after different numbers of files are processed.

To fix this, you should first remove the empty alignment files before processing the directory of alignments with phyluce, e.g. with:

find $PWD/acropyga-nex-trim-clean -type f -empty -print -delete
solenopsis1840 commented 2 years ago

Thanks so much for looking into this so quickly, Brant! Yes, I can see how the empty files could create problems downstream and I'll make sure to identify and delete such files. However, the issue persists even if I delete such files,

I've looked some more into this problem and specifying a separate error file in the job script, but that didn't help any with identifying what is going on. All I can say right now is that it does not seem to happen when I run phyluce on the interactive queue, i.e., always the same number of files are output. That's an easy enough work around, but it does make me wonder what else may not running as it should in the queued jobs....

Thanks again and I'll keep you posted on what (if anything) I find!

brantfaircloth commented 2 years ago

Might be worth trying in the queue but with a single thread and not multiple threads. It's not clear to me how the threads are being assigned to the job in the submission script you sent, and something could be going weirdly there. Running with a single thread would help narrow down source of the odd behavior you are seeing.

solenopsis1840 commented 2 years ago

Hi Brant, It seems to be an issue with not sufficient memory being assigned for the job. Since our interactive queue has more memory allocated by default than the normal submission queues, it ran there. Thanks again for your help! Best, Dietrich

brantfaircloth commented 2 years ago

Weird. But glad you got it sorted and thanks for letting me know! I’ll file this away in my brain as a potential source of trouble if anyone else sees the same behavior. 👍