dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

Aligning to large chromosomes #444

Closed carlahurt closed 3 years ago

carlahurt commented 3 years ago

Hello, I am on step 3 and I'm using a related species as a reference genome. This genome is a beast (32 Gb)! The program seems to recognize that we are dealing with large chromosomes. There are a couple of issues. I see where it is recommending the -c flag but there is also a python error dealing with 'str' that I'm not sure how to fix. Thank you in advance for your help!
image

isaacovercast commented 3 years ago

Hello. Whoops, this is a very small error in the string comparison. I fixed this (333779d) and pushed a new version of ipyrad (0.9.78) which should be up on bioconda within the next 24 hours. Give it a try and let me know how it goes.

carlahurt commented 3 years ago

Thank you for your help with this. I added the fix from your post and made it a bit further in step 3 and hit another python error:

image

isaacovercast commented 3 years ago

Something failed during step 3, the mapping reads step should take longer than 21 seconds (typically). The error message indicates that one or more of the bam files failed to index. I'm guessing what happened is you ran out of disk space at some point and things started to fail silently. Can you verify that you have sufficient disk space for performing the assembly? Typically this will require hundreds of GB of free space, but again, the amount of required space will be dependent on the amount of raw data you have.

carlahurt commented 3 years ago

Hi Isaac,

I have 17 TB of free space available to my home directory, no quotas are in place. Would any of the files be written in another location if I am running the script from inside my home directory? The HPC administrator has not seen any of the compute node /tmp disks filling up.

I tried to run this again after your update. image

It looked like it finished step 3, but when I tried to proceed to step 4 it failed because none of the files were ready: image

After updating the conda environment we are receiving a warning that the latest version of HDF5 (1.10.7) is incompatible with the version that ipyrad was built on (1.10.6). We can get around this by setting "HDF5_DISABLE_VERSION_CHECK=1". We don't think this is related to the problem.

Thank you for your help!

isaacovercast commented 3 years ago

Hello, well the current version of ipyrad is 0.9.78, and from the screenshot it looks like you're still not on the most recent version, so if you can update and try again that would be great. It might be best to create a new conda environment and install fresh inside this new environment, I bet this will solve the hdf5 error as well. Good luck.

carlahurt commented 3 years ago

Hi Isaac,

We updated the Conda environment and ipyrad to 0.9.78. We are still receiving the following error message for step 3 (after the mapping reads step that appears to be related to an index for bamfiles:

''' Parallel connection closed. ValueError
Traceback (most recent call last) in ~/.conda/envs/ipyrad2/lib/python3.9/site-packages/ipyrad/assemble/clustmap.py in build_clusters_from_cigars(data, sample) 2153 # uncomment and compare against ref sequence when testing 2154 # ref = get_ref_region(data.paramsdict["reference_sequence"], reg) -> 2155 reads = bamfile.fetch(reg) 2156 2157 # store reads in a dict pysam/libcalignmentfile.pyx in pysam.libcalignmentfile.AlignmentFile.fetch() ValueError: fetch called on bamfile without index '''

After consulting with our HPC expert, this was not likely due to disc limitations. Do you have any other suggestions to complete step 3?

Thanks again, Carla

isaacovercast commented 3 years ago

Hi Carla,

It's possible that some files in the _refmapping directory from a previous run haven't been cleaned up. You might try removing the full _refmapping directory from the project_dir, and then try running step 3 again with the -f flag, to force overwriting. Let me know how it goes.

-isaac

PS - When posting it's really great and helpful when you include screenshots of the complete output of the ipyrad run and also the full error message.

carlahurt commented 3 years ago

Hi Isaac, I deleted the old files and renamed the output folder so that overwriting wasn't an issue. I also included the -f flag

image

I am still encountering an error related to an index. Please let me know if there is any additional information that might be helpful. image

isaacovercast commented 3 years ago

Something isn't right, the mapping step is going way too fast. Can you post the full results of an ls -ltr in the *_refmapping and _clust directories? I know they will be quite large because you have a lot of samples, but I need to see what's going on.

carlahurt commented 3 years ago

Certainly - Attached are the results of these two folders. Thank you so much for taking the time to look this over. ls_ltr_tmp.txt ls_ltr_refmapping.txt ls_ltr_clust.txt

isaacovercast commented 3 years ago

In the *_refmapping directory you can see these files:

-rw------- 1 churt domain users 2477521779 Jun  6 08:27 B6_E005Y2.sam
-rw------- 1 churt domain users  219666275 Jun  6 08:28 B6_E005Y2-unmapped.bam
-rw------- 1 churt domain users  708752024 Jun  6 08:29 B6_E005Y2-mapped-sorted.bam
-rw------- 1 churt domain users  728515803 Jun  6 08:29 B6_E005Y2-unmapped.fastq

but there should be a *-mapped-sorted.bam.bai file which should be generated by a samtools index command. It's possible that samtools is not installed or not installed correctly on your system (even though it should come down as a dependency of ipyrad). Can you log in to the computer you're running ipyrad on, change directory to the *_refmapping directory and run this command:

samtools index B6_E005Y2-mapped-sorted.bam

Please make sure you are in the same conda environment as you are when you run ipyrad. Let me know what happens, and if there are any error messages please post them here.

carlahurt commented 3 years ago

Hello,

Here is the screenshot from the refmapping directory: image

Please let me know if you need more information. Thanks for your help!

isaacovercast commented 3 years ago

I see the problem. Samtools index by default can't handle ref seqs with large chrom size. There's a flag to pass the index command to allow this, so i updated the code to use the samtools index -c option by default. I pushed a new version which should be up on bioconda within a day or so (v0.9.80). Once it is up there please install it and try again (you only should have to run step 3 again).

On Wed, Jun 9, 2021 at 3:03 PM carlahurt @.***> wrote:

Hello,

Here is the screenshot from the refmapping directory: [image: image] https://user-images.githubusercontent.com/31260532/121435319-4525db00-c944-11eb-8219-98bfb067ba64.png

Please let me know if you need more information. Thanks for your help!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/444#issuecomment-858131711, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNSXP53Y26O7TUYIUQYOBLTR7QJTANCNFSM45WOC4EQ .