DormanLab / AmpliCI

AmpliCI, a model-based algorithm for denoising Illumina amplicon data.
BSD 3-Clause "New" or "Revised" License
19 stars 7 forks source link

core dumped for 60,000,000 amplicon reads 253 bp (max 1 bp difference) #3

Open jianshu93 opened 3 years ago

jianshu93 commented 3 years ago

Hello Xiyu,

When I was running AmpliCI and pool all samples, It worked for a small dataset close to 9,000,000 but this error for 60,000,000:

/var/slurm/spool/slurmd/job40856275/slurm_script: line 24: 14421 Segmentation fault (core dumped) ./run_AmpliCI --error --fastq /condo/ieg/jianshu/app/all_samples_merged.filt.fq --outfile /condo/ieg/jianshu/app/all_samples_merged.filt.error INFO [/condo/ieg/jianshu/app/AmpliCI/src/options.c::parse_options(339)]: Error profile: /condo/ieg/jianshu/app/all_samples_merged.filt.error

/var/slurm/spool/slurmd/job40856275/slurm_script: line 25: 15151 Segmentation fault (core dumped) ./run_AmpliCI --profile /condo/ieg/jianshu/app/all_samples_merged.filt.error --fastq /condo/ieg/jianshu/app/all_samples_merged.filt.fq --outfile /condo/ieg/jianshu/app/all_samples_merged.filt.out

The first error profile step throw a core dumped after ~20 minutes. I assigned 1TB memory for it and I think the first step is not memory intensive.

Thanks,

Jianshu

xiyupeng commented 3 years ago

Hi, Jianshu !

Thanks for the input ! From your error info, I suspect the fist step of error estimation has already been finished, since the information below indicates the denoising step has already started...

INFO [/condo/ieg/jianshu/app/AmpliCI/src/options.c::parse_options(339)]: Error profile: /condo/ieg/jianshu/app/all_samples_merged.filt.error

AmpliCI would gradually allocate memory when there are more and more clusters. So it could run out of memory for big datasets from very complex organism, like soil microbiome. So currently I would recommend to run AmpliCI on each sample separately (It generally works well under millions of reads). We will think about a scalable method for big datasets.

I think you should find all_samples_merged.filt.error under the folder if the fist step has been finished. If not, I would be very surprised that it would run out of memory in the error estimation step. Perhaps there is a possible bug for the Segmentation fault and I will further check for it.

Thanks, Xiyu

xiyupeng commented 3 years ago

Hi, Jianshu !

We want to test AmpliCI on a big dataset, in order to eliminate possible bugs for big datasets. Do you know or recommend any amplicon dataset with 20+M reads, on which we could test AmpliCI. We realized the datasets contain 16B nucleotides, which may be out of the range of some data types we used in the code.

Moreover, we recommend to run error estimation step on only a small subset of the data, even if you want to work on the pooled data. You may not need such big data to estimate the error profile and it may take a long time. We share a similar error model with DADA2 and DADA2 will read samples in memory until at least 1e8 total bases has been reached, for big datasets. We believe a subset with 1 million reads is fine for the error estimation.

Thanks, Xiyu

jianshu93 commented 3 years ago

Hi Xiyu,

I would suggest one here: https://www.nature.com/articles/s41564-019-0426-5

You may need to download this dataset via NCBI ftp site, it is too big (about 50G). Let me know when you need some help.

Thanks, Jianshu

jianshu93 commented 3 years ago

There must be something wrong with this data/or bug. Not that many sequences but still running after 10 hours.

Thanks,

Jianshu NGmerge_demo_all_sample_merged.fastq.zip

xiyupeng commented 3 years ago

Have you trim your reads ? For the small subset, I found the reads length range from 250 to 400. Based on the quality profile and you may want to truncate reads at 250bp.

Rplot03

You cannot use reads with variable length as the input in error estimation step (you will see warning message when running it), though the denoising step may be able to accept the variable length reads. It causes problem and results in an inaccurate error profile. And AmpliCI rely an accurate error profile for making the inference. And you may not need a very big dataset to estimate these error rates.

I run AmpliCI on this small subset with no problem even I do not truncate. Can you make more clear on which step you are running after 10 hours and number of reads in the datasets ?

Thanks, Xiyu

jianshu93 commented 3 years ago

I think it has something do with platform. My hpc system runs for 10 hours and is still running but my MacBook Pro finishes in like half an hour. Strange though.

Thanks,

Jianshu

jianshu93 commented 3 years ago

It finished finally after 10 hours....I think this hpc node is really old one..

Thanks,

Jianshu