dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

denovo+reference/denovo-reference not supported in v0.9 (yet) #376

Open leonardslog opened 4 years ago

leonardslog commented 4 years ago

Hello, it looks like the hybrid assembly approach outlined in the documentation is not supported for the 'ddrad' datatype in v0.19 and v0.20 (as is removed from the params file option). is this functionality deprecated from the earlier versions or has this always been unsupported with regard to single end reads? Any alternative solutions would be much appreciated, thanks!

output (command: ipyrad -p params-test.txt -s 1234 -c 8 -f):

ipyrad [v.0.9.20] Interactive assembly and analysis of RAD-seq data

Parallel connection | LAPTOP-xxxxxxxx: 8 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:00:09 | loading reads 2 fastq files loaded to 2 Samples.

Step 2: Filtering and trimming reads [####################] 100% 0:01:29 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:00:12 | indexing reference

Encountered an Error. Message: datatype + assembly_method combo not currently supported.

Parallel connection closed.

isaacovercast commented 4 years ago

Hello, The v0.7 to v0.9 version upgrade included a major overhaul of the internals of step 3. The hybrid denovo+reference method is supported for all datatypes, but we haven't finished polishing this assembly method for the new version, so it's currently hidden. Shouldn't be too long before it's ready for a test drive, but with the holidays and all hard to put a timeline on it. I'll leave this issue open becuase, yeah, it's a known problem and we should fix it. -isaac

leonardslog commented 4 years ago

Awesome, thanks for the clarification!

ajbarley commented 4 years ago

Hey all, I was wondering if there are any imminent plans to release a new ipyrad version where the denovo+reference method can be used? I have some analyses I am trying to finalize for which it would be useful, basically trying to decide if I should move forward without this option or if there are plans to re-implement it soon. Thanks!

isaacovercast commented 4 years ago

@ajbarley There are no imminent plans to implement denovo+reference at this point. This is a 'would be nice' feature, which for boring reasons is actually rather tricky, so it stays low on the pile. I will speculate that this will be fixed within one calendar year plus/minus a year ;)

ajbarley commented 4 years ago

Sounds good, thanks for the update @isaacovercast!

ChuanLego commented 2 years ago

Hi @isaacovercast , just wondering is the denovo+reference supported now or soon? I just tried my data it told me not supported, and thinking it would be great if it will be supported soon. Cheers

isaacovercast commented 2 years ago

Hi @ChuanLego, I see that I previously prognosticated 1 year +/- 1 year as the soft 'deadline' for when this feature would be added, and we're approaching that date. At this point the denovo+reference method is still on the low-priority pile, unfortunately. I agree it would be great to have, but the amount of work it would take to reimplement is far higher than the benefit that would be obtained from having it available. It's an edge case that I would love to handle, but I don't see it happening any time soon, sorry to say.

jogijsbers commented 1 year ago

Hi @isaacovercast, just wondering if there is any news regarding this feature? Cheers!

isaacovercast commented 1 year ago

Hi @jogijsbers, unfortunately there has not been any motion on this still. The denovo-reference method can be implemented with the reference_as_filter parameter. The denovo+reference method in practice doesn't recover much different data than a standard denovo or reference assembly, so it's still something on the low priority list for me. Let me know if you have any questions about performing an assembly with or without reference, if this might help you proceed. All the best!

phlomitero commented 1 year ago

Hi @isaacovercast, one issue regarding this topic. I'm running ipyrad 0.9.50 and I've tried to run it with the reference_as_filter parameter for filtering chloroplast sequences. However, the run stops in step 3 because it seems that it does not find the reference file in spite that it is located in the main folder (I've tried also adding ./ at the beginning in the params file but same result). Cytinus.salida.txt params-Cytinus.txt I wonder whether you can give some advice on what I'm doing wrong. All the best!

isaacovercast commented 1 year ago

@phlomitero v0.9.50 is pretty old, there's a reasonable chance this problem has been fixed already. Can you please update to the most recent version (0.9.93) and try again?

phlomitero commented 1 year ago

Hi @isaacovercast , v.0.9.50 is the one we have installed in the cluster. I'm running a subset of samples in a local computer with version 0.9.92 and seems to go fine (I'll ask for an update in the cluster). However, I've realized that step 3 clustering/mapping is far slower than the standard denovo assembly without the reference_as_filter option. Am I right? Thanks a lot for the answer and for keeping this wonderful software!! All the best!

isaacovercast commented 1 year ago

@phlomitero Wonderful, glad it is working on your local computer and thanks for the positive feedback! Step 3 should be faster with the reference_as_filter option, but there are conditions where it could be somewhat slower. How much slower is 'far slower'? Are you using the same computer and the same number of cores for the w/ vs w/o reference_as_filter runs?

perryleewoodjr commented 10 months ago

@isaacovercast I have been using ipyrad v.0.9.43 and I have been using the reference flag (#5 assembly method) with a rather large genome (11gb), with a single plate of ddrad (single end) data (96 samples). The issue that I am having is that after 3 days on a HPC with the maximum number of nodes that can be requested (26) it reaches 50% and never makes it any further even though there is plenty of walltime left. Do you have any advice on how to get this to actually finish? ANy information would be greatly appreciated. Thanks.

isaacovercast commented 10 months ago

@perryleewoodjr What sub-step of step 3 is it reaching 50% on? Can you post the job submission script? Is it 26 'nodes' (using MPI) or 26 cores on 1 node? If it is stuck in indexing the reference sequence, it won't matter how many nodes are used because this part of the process doesn't use MPI. If the genome is huge and the amount of ram allocated is not sufficient then the process will be very slow as it will run out of memory and start paging to disk (which will be painfully slow). I suspect this is what's happening. Can you allocate more RAM? You can also ssh to the compute node running the ipyrad process and look at 'top' and 'free' to see if you can figure out more what's happening.

perryleewoodjr commented 10 months ago

Here is the output:

Step 3: Clustering/Mapping reads within samples [########## ] 50% 3 days, 0:00:18 | indexing reference

Here is the bash script:

!/bin/bash

SBATCH --time=72:00:00 # walltime

SBATCH --nodes=1

SBATCH --ntasks=26

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=10G # memory per CPU

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

module load mpi

PARAMS=$1 STEP=$2 OUT=$3

ipyrad -p $PARAMS -s $STEP -c 26 -f 1> $OUT 2>&1

cd $SLURM_SUBMIT_DIR

exit 0

I will check top and free.

Thank you!

isaacovercast commented 10 months ago

Yeah, 10GB might not be enough for an 11GB genome. One trick that you could try is getting an interactive session on the cluster with a big chunk of memory and then running bwa index on the reference sequence by hand. If ipyrad finds the index files then it will skip this part of the process.

perryleewoodjr commented 10 months ago

Yeah, it seems like it is has a high virtual memory requested. We have some high memory nodes that I can try. I will check out bwa index. Please let me know if i am interpreting this correctly. I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
252059 XXXX 20 0 165984 4516 1920 R 1.6 0.0 0:06.25 top

I really appreciate your help.

isaacovercast commented 10 months ago

"I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step." <- Yes, that is correct.

Good luck, let me know how it goes.

TheGreatJack commented 10 months ago

Some warning or small section about the current inability to use some datatypes with specific assembly methods should be added to the documentation. Is this limited just to the ddrad denovo+reference mode?

isaacovercast commented 10 months ago

@TheGreatJack Thanks for the suggestion, I updated the docs to specify that we don't support these methods any more and also to add details about the reference_as_filter parameter:

https://ipyrad.readthedocs.io/en/master/6-params.html#assembly-method

phlomitero commented 6 months ago

@isaacovercast Sorry for not answering before but I have a doubt with the speed of the assembly method (denovo vs. reference). I have the same dataset running on a 40 cores cluster with "denovo" as the assembly option, and as you can see the times are as follows:

ipyrad [v.0.9.92] Interactive assembly and analysis of RAD-seq data


Parallel connection | nodo92: 40 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:06:05 | loading reads
286 fastq files loaded to 143 Samples.

Step 2: Filtering and trimming reads [####################] 100% 0:26:51 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:38:34 | join merged pairs
[####################] 100% 0:17:56 | join unmerged pairs
[####################] 100% 0:34:57 | dereplicating
[################### ] 99% 17 days, 13:15:16 | clustering/mapping it is still running....

And the same dataset is running on my local machine with 20 cores and the assembly method is set to "reference" because I have used a small chloroplast genome (about 150 Kb) as a reference. And the times are really, really slow in the latter case:

ipyrad [v.0.9.94] Interactive assembly and analysis of RAD-seq data


Parallel connection | rafa-Precision-3660: 20 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:12:43 | loading reads
286 fastq files loaded to 143 Samples.

Step 2: Filtering and trimming reads [####################] 100% 6:07:05 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:00:01 | indexing reference
[####################] 100% 10:54:27 | join unmerged pairs
[####################] 100% 18:34:56 | dereplicating
[####################] 100% 10:30:44 | splitting dereps

I didn't expect such a difference, even if I'm using half the number of cores because I thought using a reference speeds the process. Further, when the assembly method is set to "reference" the size of the temporal files increases greatly (in fact, it consumes all the space in my cluster account -1TB- and then stops).

Is there any suggestion you can provide me in order to speed the analysis? Surely I'm doing something wrong.... Thanks!

phlomitero commented 6 months ago

Sorry, the bold case was not intentional but due to the sequence of dashes.....

isaacovercast commented 6 months ago

@phlomitero The runtime on your local computer with 20 cores and the reference assembly method is almost certainly because of underallocation of RAM. You will need at least 4GB of free RAM per core (so 80GB of free RAM). With Paired end data it could be more than 4GB. If the cores do not have enough RAM and the data is very large then it will go VERY slowly. Also, step 3 has several substeps which happen before the reference alignment, and these steps happen in both the denovo and reference assembly method, so the step 3 running on your laptop hasn't even reached the point of using the reference yet (in the example run that you sent).

In reference assemblies the size of temporary files is definitely bigger, it's part of the trade-off for speed. There's nothing you can do about the file sizes except try to get a bigger disk allocation.

If you run this data as Single-end and use only R1 then it would make things faster and also the temp files would not be so large.

phlomitero commented 6 months ago

Understood! I will beg for more allocation memory! Thanks a lot!