Problem with "Step 3: clustering"

StuntsPT commented 5 years ago

I've been having a very weird issue with my ipyrad analysis: Specifically Step3. Some of the samples are not finishing the clustering. ipyrad claims Step3 is finished and moves to Step 4, however in this step, it is stated that some samples have not been clustered.

Under the Tlep01_clust_0.85 directory I find this:

-rw-r--r-- 1 francisco cobig2  56186830 Jun  7 05:35 Bot01.clust.gz
-rw-r--r-- 1 francisco cobig2        48 Jun  7 06:19 Bot01.clustS.gz
-rw-r--r-- 1 francisco cobig2  67104370 Jun  7 05:34 Bot01.htemp
-rw-r--r-- 1 francisco cobig2  91397477 Jun  7 05:34 Bot01.utemp
-rw-r--r-- 1 francisco cobig2  91397477 Jun  7 05:34 Bot01.utemp.sort
-rw-r--r-- 1 francisco cobig2  22245673 Jun  1 10:24 Gal02.clustS.gz
-rw-r--r-- 1 francisco cobig2  18049403 Jun  1 10:25 Gal03.clustS.gz

Bot01_trimmed_merged is one of the samples that Step 4 complains does not have any clusters, while Gal02 and Gal03 are samples that "worked". The file Bot01.clust.gz is very similar in structure to Gal02.clustS.gz, however, Bot01.clustS.gz is empty (as you can probably assume by its size).

What do you think can be the cause of this?

StuntsPT commented 5 years ago

Just for the sake of completeness, here are the first few lines of Gal02.clustS.gz and Bot01.clust.gz

zcat Gal02.clustS.gz |head                                                                                                                                                                                                0
b28abbacffa082292509cf3dcfeab9bd;size=13;*
TAAGCCCACTGGGGGAGGGTGTTAATGTAGAAGTGGCTTCTTCTTATGAGGATGTTTTGCAAGAGAGACATTTTTACATGCAGCTGCAGAGTAATATGTAGTTATGCCCTGAGT
//
//
81165659719a875b4bc8511160ad856a;size=13;*
CATACTGGGTCGCTCAATCATTGCTACTGTGTCCTTTATTCTAGCAGATCAAGATGATCAGGTGACAGCAAAAACTTCAAGGTTTTTAGCGGGAACAATTAGAATAAGATTTATTTGATT
//
//
c65b462470c083df934327dcd50f89e3;size=13;*
CACTTCCAGAGAACAGCACTGGCTGGACCCATGGATTTATGTTATGGAGTCCCACAGGGCTCCATCTTATCCCTCATGCTGTTCAATATCTACAGCAGGGGTGGCCAACTCCCA

zcat Bot01.clust.gz  |head                                                                                                                                                                                                0
>000018efa0c56c15b8211105139de92d;size=4;*
GCAGCCCCAGTTGTACTTTTAGACAAGCCTGATGGCTCTGTCAGATTTTGCATCGATTATAGAAAATTAAACCATGTCACTAAAGCGGACGCCTACCCAATGCCCCGCTTAGATGACCTT
>7302c28a504affebca3f7f9b2f8f54b0;size=2;+
AGTTCCTGGGCAGCCCCAGTTGTACTTTTAGACAAGCCTGATGGCTCTGTCAGATTTTGCATCGATTATAGAAAATTAAACCATGTCACTAAAGCGGACGCCTACCCAATGCCCCGCTTA
>ecd74b24946f0307c4f9b44d8ec96914;size=1;+
GTCCCTGGGCAGCCCCAGTTGTACTTTTAGACAAGCCTGATGGCTGTCAGATTTTTCATTGATTATAGAAAATTAAACCATGTCACTAAAGCGGACGCCTACCCAATGCCCCGCTT
//
//
>0000240af5d10f5fc3ff921e6d940847;size=6;*
CCACAGCCTAGGAATGGGTGGGGTGAGGGCAGGATATCCTAATGATCTTCTACCAATGACTTGGTGAAATAATTGGACAAAAAACCCAGTATGTGAGTTTAAAAATAATTAGCTCAAACC

Now that I am looking more closely, they are actually significantly different...

isaacovercast commented 5 years ago

Are there any error messages in the ipyrad_log.txt file? What param settings are you using? Is this ls Tlep01_clust_0.85 run after step 3 completed? I'm curious because the htemp/utemp files for the failed sample aren't being cleaned up. Could indicate some problem. If you want to dropbox me the files for the Bot1 sample I can try to take a look at it. Also, if you re-run step 3 and include the -d flag it'll write more info to the log file (if you DO NOT include the -f flag then it'll only try to re-run those samples that previously failed step 3.

StuntsPT commented 5 years ago

Hi @isaacovercast, I will run with -d and without -f and post the results here. Which files for Bot1 would you like?

isaacovercast commented 5 years ago

All the files from the _clust directory, the file from the _fastqs directory. And the params file you used would be great.

On Fri, Jun 7, 2019 at 11:35 AM Francisco Pina-Martins < notifications@github.com> wrote:

Hi @isaacovercast https://github.com/isaacovercast, I will run with -d and without -f and post the results here. Which files for Bot1 would you like?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/345?email_source=notifications&email_token=ABNSXP73TITDJJCJ3IUH4GLPZJ55DA5CNFSM4HVXRCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXGGAVQ#issuecomment-499933270, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNSXP4RHFRPHMMPNHDXLV3PZJ55DANCNFSM4HVXRCQQ .

StuntsPT commented 5 years ago

Will do. I'd normally shrug this off as some problem with those samples, but the non-cleanup of tempfiles has tipped me off that something else might be at play. I'm not sure I'll be able to upload thing in time today, so expect the files by next Tuesday only. Sorry about that!

StuntsPT commented 5 years ago

Hi @isaacovercast , sorry about the long wait. I have been preparing a "minimal" example for easy reproducibility. Here is a dropbox link, which contains:

2 samples in the original dereplicated fastq.gz format

The entire project directory It was run using the following command: ipyrad -p params-Bug345.txt -s 1234567 -c 16 -d (debug mode activated) Here is the STDOUT:

╭──francisco@Kakarotto [12:44] [~/Data_analyses/Tlep/bug_assembly]    {ipyrad}
╰─$ ipyrad -p params-Bug345.txt -s 1234567 -c 16 -d                                                                                                                                                                                          0

** Enabling debug mode **

-------------------------------------------------------------
ipyrad [v.0.7.30]
Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
New Assembly: Bug345
establishing parallel connection:
host compute node: [16 cores] on Kakarotto

Step 1: Loading sorted fastq data to Samples
[####################] 100%  loading reads         | 0:00:10
2 fastq files loaded to 2 Samples.

Step 2: Filtering reads
[####################] 100%  processing reads      | 0:01:35  

Step 3: Clustering/Mapping reads
[####################] 100%  dereplicating         | 0:00:24  
[####################] 100%  clustering            | 1:02:59  
[####################] 100%  building clusters     | 0:02:48  
[####################] 100%  chunking              | 0:00:15  
[####################] 100%  aligning              | 0:00:01  
[####################] 100%  concatenating         | 0:00:08  
no clusters found for Ses02

Step 4: Joint estimation of error rate and heterozygosity
skipping Ses02; not clustered yet. Run step3() first.
[####################] 100%  inferring [H, E]      | 0:00:01  
Info: Sample Bot01 - No clusters have sufficient depth for statistical
      basecalling. Setting default heterozygosity/error to 0.01/0.001.

Step 5: Consensus base calling 
Skipping Sample Ses02; not yet finished step4 
Skipping Sample Bot01; No clusters found.

Encountered an error (see details in ./ipyrad_log.txt)
Error summary is below -------------------------------

No samples to cluster, exiting.

The two things I find odd here: I have "grepped" the original fastq, and I have confirmed that some sequences are present more often than 4 times on the "Bot01" sample (where the maximum cluster seems to be 4). That makes the result of "no clusters found" extremely suspicious. Here is an example:

zgrep -c "GGGGAGACAGAGATTACATTGGCATGCAGTCAGCCGAGAAAATGCTCTTCCTTAATCTTAGAATTGTAGAGTTGGAAGGGACCATGAGGATCATCCCGTCCAACCCCCTGCAA" Bot01.fastq.gz
23

I have no idea why there is no Ses02.clustS.gz file. Neither sample got the tempfiles cleared up.

I hope we can reach the bottom of this. Oh, and BTW, this was all performed on a "clean" conda environment with first having run export PYTHONNOUSERSITE=True to make sure no system package is used. Also, sorry about the huge size, but I wanted to make sure I didn't miss anything.

isaacovercast commented 5 years ago

Hello Francisco, Thanks for sending this along. I ran the data and it works for me. I just re-ran the whole thing from step 1 with the -f flag and it looks fine. In your output it looks like there was a problem with the alignment step, since your alignment step finished far too quickly. This could be an indication of disk allocation issues (like you are running out of disk space and the alignment processes are dying). Are you sure you have enough disk space?

[image: image.png]

On Tue, Jun 18, 2019 at 11:22 AM Francisco Pina-Martins < notifications@github.com> wrote:

Hi @isaacovercast https://github.com/isaacovercast , sorry about the long wait. I have been preparing a "minimal" example for easy reproducibility. Here is a dropbox link https://www.dropbox.com/sh/5x7pkqrahvwmg9q/AAAxl6jDn_fM_Y-98xDJihwUa?dl=0, which contains:

2 samples in the original dereplicated fastq.gz format

The entire project directory It was run using the following command: ipyrad -p params-Bug345.txt -s 1234567 -c 16 -d (debug mode activated) Here is the STDOUT:

╭──francisco@Kakarotto [12:44] [~/Data_analyses/Tlep/bug_assembly] {ipyrad}

╰─$ ipyrad -p params-Bug345.txt -s 1234567 -c 16 -d 0

Enabling debug mode

ipyrad [v.0.7.30]

Interactive assembly and analysis of RAD-seq data

New Assembly: Bug345

establishing parallel connection:

host compute node: [16 cores] on Kakarotto

Step 1: Loading sorted fastq data to Samples

[####################] 100% loading reads | 0:00:10

2 fastq files loaded to 2 Samples.

Step 2: Filtering reads

[####################] 100% processing reads | 0:01:35

Step 3: Clustering/Mapping reads

[####################] 100% dereplicating | 0:00:24

[####################] 100% clustering | 1:02:59

[####################] 100% building clusters | 0:02:48

[####################] 100% chunking | 0:00:15

[####################] 100% aligning | 0:00:01

[####################] 100% concatenating | 0:00:08
no clusters found for Ses02
Step 4: Joint estimation of error rate and heterozygosity
skipping Ses02; not clustered yet. Run step3() first.
[####################] 100% inferring [H, E] | 0:00:01
Info: Sample Bot01 - No clusters have sufficient depth for statistical

      basecalling. Setting default heterozygosity/error to 0.01/0.001.
Step 5: Consensus base calling
Skipping Sample Ses02; not yet finished step4

Skipping Sample Bot01; No clusters found.
Encountered an error (see details in ./ipyrad_log.txt)

Error summary is below -------------------------------
No samples to cluster, exiting.
The two things I find odd here: I have "grepped" the original fastq, and I have confirmed that some sequences are present more often than 4 times on the "Bot01" sample (where the maximum cluster seems to be 4). That makes the result of "no clusters found" extremely suspicious. Here is an example:

zgrep -c "GGGGAGACAGAGATTACATTGGCATGCAGTCAGCCGAGAAAATGCTCTTCCTTAATCTTAGAATTGTAGAGTTGGAAGGGACCATGAGGATCATCCCGTCCAACCCCCTGCAA" Bot01.fastq.gz

23

I have no idea why there is no Ses02.clustS.gz file. Neither sample got the tempfiles cleared up.

I hope we can reach the bottom of this. Oh, and BTW, this was all performed on a "clean" conda environment with first having run export PYTHONNOUSERSITE=True to make sure no system package is used. Also, sorry about the huge size, but I wanted to make sure I didn't miss anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/345?email_source=notifications&email_token=ABNSXP7K3PV67BQOZ75Z3HLP3EKWJA5CNFSM4HVXRCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7LHEY#issuecomment-503231379, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNSXP4XCBJ5VE2ZW24TTVDP3EKWJANCNFSM4HVXRCQQ .

StuntsPT commented 5 years ago

That is wierd indeed. The drive the analyses are running on still has plenty of space left though (~150GB left). It's not that much, but really should suffice. But now that you mention space and a step being too fast... This is being run on a machine with an NVMe SSD. I will try to reproduce on a machine with an HDD. I will report back as soon as I am able to run this minimal example on our HPC, which has HDDs.

PS - Your attached image did not display on github

isaacovercast commented 5 years ago

Hm, maybe that's it. If you're running on a very old spinning disk the pipe could overflow and this would cause all kinds of problems. Try it.

On Wed, Jun 19, 2019 at 4:10 PM Francisco Pina-Martins < notifications@github.com> wrote:

That is wierd indeed. The drive the analyses are running on still has plenty of space left though (~150GB left). It's not that much, but really should suffice. But now that you mention space and a step being too fast... This is being run on a machine with an NVMe SSD. I will try to reproduce on a machine with an HDD. I will report back as soon as I am able to run this minimal example on our HPC, which has HDDs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/345?email_source=notifications&email_token=ABNSXP4JVGA3K4AJLB4HA2LP3KVELA5CNFSM4HVXRCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDNSUA#issuecomment-503765328, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNSXP2BANKYDZCJV6NS4KDP3KVELANCNFSM4HVXRCQQ .

StuntsPT commented 5 years ago

Ok, confirming this: On out 2012 HPC, which has an HDD, everything runs as normal! Specs:

Intel Xeon E5-2609
SAS HDDs in RAID5

However, on my workstation, with an NVMe SSD, the error ocurrs. Specs:

AMD Ryzen 7 2700
Samsung 970EVO (SM981/PM981)

Do you have another machine with an SSD where you can confirm this? I will test on my home box which also has a similar SSD and see if I can reproduce the issue there.

If this was a problem with the low throughput of an HDD, it would be Ok on my book, but if it is an issue with the speed of SSDs, maybe this issue is worth pursuing?

isaacovercast commented 5 years ago

Glad to hear it's working. SSD r/w should be faster than HDD across the board, so my suspicion is that it's something other than the drive. I've run ipyrad on HDD and SSD boxes dozens and dozens of times, so I suspect this is some weird edge case in some config aspect of your workstation in which case it's not really worth troubleshooting, imho, unless you feel really motivated. HW bugs can be a real head-ache to track down, though.

On Thu, Jun 20, 2019 at 4:54 AM Francisco Pina-Martins < notifications@github.com> wrote:

Ok, confirming this: On out 2012 HPC, which has an HDD, everything runs as normal! Specs:

Intel Xeon E5-2609 SAS HDDs in RAID5

However, on my workstation, with an NVMe SSD, the error ocurrs. Specs:

AMD Ryzen 7 2700 Samsung 970EVO (SM981/PM981)

Do you have another machine with an SSD where you can confirm this? I will test on my home box which also has a similar SSD and see if I can reproduce the issue there.

If this was a problem with the low throughput of an HDD, it would be Ok on my book, but if it is an issue with the speed of SSDs, maybe this issue is worth pursuing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/345?email_source=notifications&email_token=ABNSXP3BY3AB7NYYXNWSEALP3NOV3A5CNFSM4HVXRCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYFB6ZI#issuecomment-503979877, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNSXP4KZNKYUGP5S4Z7OK3P3NOV3ANCNFSM4HVXRCQQ .

StuntsPT commented 5 years ago

Ok, I can reproduce this in my home machine too. Same error actually. Specs:

AMD Ryzen 5 2400G
Samsung 970EVO (SM981/PM981)

So this is not exclusive to a single system, but admittedly my home box and my workstation are rather similar. I think this can be closed for now, but I'll reopen it should I find out more.

StuntsPT commented 5 years ago

Just as a reference, I can also reproduce on my laptop: Specs:

Intel i7-4700HQ
Samsung 840EVO

So far, in common these systems have: SSD; ArchLinux, up-to-date as of June 22 2019

dereneaton / ipyrad

Problem with "Step 3: clustering" #345