looks for help: IpyradError(NO_ZIP_BINS.format(refseq_file)) at step 3.

LaylaIsntHere commented 1 year ago

Good days!

I'm a student majoring in ecology. I encounter some troubles when I do bioinformatics, the reason for it may be out of the step of mapping the reads of fragments of all samples with the references.

I'm working on a project on population genetic differences in an Abies species. I used a general protocol to extract plant DNA. Then I followed the MIG-seq protocol (Yoshihisa Suyama & Yu Matsuki (2015)) to pre-process the DNA and finally sequence with Illumina's MIseq platform. Finally, the raw data formed in the Paired-End (de-multiplexed, 292 individuals in total) was obtained.

I tried to use ipyrad for bioinformatics, to estimate the genetic distance of my samples by SNPs. The input data are raw demultiplexed fastq files (PE data), most of the parameters are defaults, and the program is run in its entirety from step 1 to step 7. Using the same input data, running with the de novo method was able to obtain the SNPs, but when using the reference method, the program always crashed after step 3. The error report is as follows:

IPyradError Traceback (most recent call last) File :1

File ~/anaconda3/envs/ipyrad_py3/lib/python3.10/site-packages/ipyrad/assemble/clustmap.py:1724, in index_ref_with_bwa(data, alt) 1722 raise IPyradError(NO_ZIP_BINS.format(refseq_file)) 1723 else: -> 1724 raise IPyradError(error)

IPyradError: [bwa_index] Pack FASTA... 235.18 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=36388353630, availableWord=2572418784 [BWTIncConstructFromPacked] 10 iterations done. 99999998 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 199999998 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 299999998 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 399999998 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 499999998 characters processed. . . . . [BWTIncConstructFromPacked] 3690 iterations done. 36360181598 characters processed. [BWTIncConstructFromPacked] 3700 iterations done. 36383742110 characters processed. [bwt_gen] Finished constructing BWT in 3703 iterations. [bwa_index] 26641.84 seconds elapse. [bwa_index] Update BWT...

(ipyrad_py3) layla@layla-virtual-machine:~/Documents/ipyrad/test_rawPE_ref1.0$

After the step '[bwa_index] Update BWT...', the program will exit automatically instead of continuing with step 4. The ref data is the genome data (.fa) of another Abies species downloaded from https://treegenesdb.org/. I very much wish I could get the SNPs data by the mapping method 'reference', I think it would be more reliable. How should I fix this error? Please get back to me if you need to provide any other information, thank you very much!

Fang

isaacovercast commented 1 year ago

Hello Fang,

It looks like there is a problem during step 3 with indexing the reference sequence. What version of ipyrad are you using? ipyrad -v will show it. Can you run bwa index on your reference sequence by hand to see if that shows the error better? Post any outputs from that here please.

LaylaIsntHere commented 1 year ago

Hello Isaac

Thank you for your help. With your guidance, I think I'm closer to the cause of this trouble. I'll update more information and the outputs of running the 'bwa index' program separately.

What version of ipyrad are you using?

The version of ipyrad is 'ipyrad 0.9.93'

Can you run bwa index on your reference sequence by hand to see if that shows the error better?

The version of my bwa is 0.7.17-r1188, which was automatically installed when I installed ipyrad. I ran a code `bwa index Abal.1_0.fa` under the path where the reference is located (the name of the reference is 'Abal.1_0.fa') to see what will happen, but in the end it failed. Here is the output of this program:

(ipyrad_py3) layla@layla-virtual-machine:~/Documents$ bwa index Abal.1_0.fa [bwa_index] Pack FASTA... 298.70 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=36388353630, availableWord=2572418784 [BWTIncConstructFromPacked] 10 iterations done. 99999998 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 199999998 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 299999998 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 399999998 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 499999998 characters processed. [BWTIncConstructFromPacked] 60 iterations done. 599999998 characters processed. [BWTIncConstructFromPacked] 70 iterations done. 699999998 characters processed. [BWTIncConstructFromPacked] 80 iterations done. 799999998 characters processed. [BWTIncConstructFromPacked] 90 iterations done. 899999998 characters processed. [BWTIncConstructFromPacked] 100 iterations done. 999999998 characters processed. . . . . [BWTIncConstructFromPacked] 3650 iterations done. 36232487470 characters processed. [BWTIncConstructFromPacked] 3660 iterations done. 36270263358 characters processed. [BWTIncConstructFromPacked] 3670 iterations done. 36303834654 characters processed. [BWTIncConstructFromPacked] 3680 iterations done. 36333668878 characters processed. [BWTIncConstructFromPacked] 3690 iterations done. 36360181598 characters processed. [BWTIncConstructFromPacked] 3700 iterations done. 36383742110 characters processed. [bwt_gen] Finished constructing BWT in 3703 iterations. [bwa_index] 24386.40 seconds elapse. [bwa_index] Update BWT... Killed

Seems the index can not be created. (After this program, some files are generated under the path, their first half is the same and the file extensions are .amb, .ann, .bwt, .fal and *.pac respectively.) Maybe it's the reason that caused my previous ipyrad program to always fail.

I'm doing bioinformatics after setting up a VM on my PC and installing Ubuntu (23.04). Here is the configuration of my VM:

(ipyrad_py3) layla@layla-virtual-machine:~/Documents$ conda info

 active environment : ipyrad_py3
active env location : /home/layla/anaconda3/envs/ipyrad_py3
        shell level : 2
   user config file : /home/layla/.condarc

populated config files : /home/layla/.condarc conda version : 23.7.3 conda-build version : 3.26.0 python version : 3.11.4.final.0 virtual packages : __archspec=1=x86_64 glibc=2.37=0 linux=6.2.0=0 __unix=0=0

(ipyrad_py3) layla@layla-virtual-machine:~/Documents$ free -hm total used free shared buff/cache available Mem: 27Gi 3.9Gi 23Gi 35Mi 278Mi 23Gi Swap: 2.0Gi 85Mi 1.9Gi

What is the reason for the failure of my program? And how can I fix it? Please let me know if there is any other relevant information I can provide.

yours, Fang

isaacovercast commented 1 year ago

Hello Fang,

Thanks for running bwa by hand and discovering the underlying issue. As I suspected the 'Killed' message indicates that you are running out of RAM and the OS is killing the bwa indexing process. Two ways to fix this would be to allocate more RAM to your VM, or to perform the indexing by hand on a machine with more ram and then transferring all the generated files (.bwt, .fal, etc) to the reference sequence directory inside your VM.

Good luck! -isaac

LaylaIsntHere commented 1 year ago

Dear Isaac,

Thank you for your reply. I tried this program again at the working station of our lab. The program still crushed, but this time I got a different error-report.

(ipyrad) user@localhost:~/Documents/FANG/20230926_ipyrad_ref_workingmerchine_utm/test-ref$ ipyrad -p params-test-ref.txt -s 1234567 -d

ipyrad [v.0.9.84] Interactive assembly and analysis of RAD-seq data

Parallel connection | linux: 16 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:00:12 | loading reads
584 fastq files loaded to 292 Samples.

Step 2: Filtering and trimming reads [####################] 100% 0:02:47 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 5:47:01 | indexing reference
[####################] 100% 0:00:32 | join unmerged pairs
[####################] 100% 0:00:26 | dereplicating
[####################] 100% 0:00:24 | splitting dereps
[####################] 100% 2:33:59 | mapping reads

Encountered an Error. Message: IPyradError: bwa error: None Parallel connection closed.

IPyradError Traceback (most recent call last) File :1, in

File ~/anaconda3/envs/ipyrad/lib/python3.9/site-packages/ipyrad/assemble/clustmap.py:1927, in mapping_reads(data, sample, nthreads, altref) 1925 error1 = proc1.communicate()[0] 1926 if proc1.returncode: -> 1927 raise IPyradError("bwa error: {}".format(error1)) 1929 # sends unmapped reads to a files and will PIPE mapped reads to cmd3 1930 cmd2 = [ 1931 ip.bins.samtools, "view", 1932 "-b", (...) 1935 samout, 1936 ]

IPyradError: bwa error: None

The input data is the raw demultiplexed PE data, and assembly_method is 'reference'. Actually, the input data is around 7GB and the reference data is around 17GB. I have no experience with bioinformatic before, and I don't know exactly how much RAM is needed for this kind of dataset. May I ask, with your experience, if my error this time is a program error or a lack of hardware?

Yours sincerely, Fang

isaacovercast commented 1 year ago

Hello Fang,

The bwa error is almost certainly an issue with running out of RAM. You will need to add more RAM, or alternatively it might work to reduce the number of cores you're running on, to allocate more RAM per core, using the -c 8 parameter.

Since these issues are really hardware related and not problems with the ipyrad codebase I'm going to close these for now. If you have continued questions about resources or ipyrad runs you can jump on the ipyrad gitter channel:

https://app.gitter.im/#/room/#dereneaton_ipyrad:gitter.im

Good luck, -isaac

dereneaton / ipyrad