comprna / RATTLE

Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
GNU General Public License v3.0
56 stars 10 forks source link

Memory issue #25

Open shernadi opened 3 years ago

shernadi commented 3 years ago

Hi, I am trying to cluster some ONT cDNA data using RATTLE however I am getting the following error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I guess it is a memory allocation issue. I am running RATTLE on an aws instance with 2T RAM and 128 cores It seems RATTLE crashes while it only uses around 150Gb of RAM. I tried to run it using less cores (64, 32) but I got the same issue. My samples contains 14 million reads but I also tried to downsample it to 2.5 million. Still ended up with the same error. Would you have any suggestion what the problem can be?

my example code: ./rattle cluster -I 2.5.million.fastq -t 64 -o out/ --fastq --iso

Many thanks,

Szabolcs

novikk commented 3 years ago

Hello,

yes, it will probably have problems with 14M reads (either time or memory problems) but should be OK with the 2.5M reads. Does this problem happen at the very beginning?

Can you check if there are reads smaller than the k-mer size? (we normally filter out reads smaller than 100 or 150bp anyway before running RATTLE).

Best, Ivan

jamlover commented 3 years ago

I am experiencing a memory issue too, perhaps because of the amount of data I am dealing with. I have 56M ONT reads that are about 108GB of data. Sequences were filtered to include only reads at least 150 bases long. I have played around with reducing the number of threads and number of input reads, but during clustering I keep getting either the bad_alloc error or segmentation fault (the latter first appearing as after I reduced the size of my input to about 1/4 of the total reads I have). As I write I am currently running roughly 1/16 of my 56M reads using 24 threads, waiting to see what happens.

My computer has 56 available threads and 512GB RAM. I just attempted the same 1/16 input (3.5M reads) on 48 threads and got the segmentation fault. When I first tested clustering with a single 4000 read fastq file, clusrtering worked. I have 14,157 such 4000 read files. 1) I'd appreciate any help in getting the program to run properly for me. 2) If it turns out that parsing my input data into smaller sets that I run individually ends up being a way for me to get the clustering to work, what would be the appropriate work flow to merge the clustering results and continue, or else, what would be a general suggested overall work flow to produce the final set of transcript isoforms that I am trying to produce.

FYI, here is my most recent failed command with error message.

rattle cluster -i all.14Mlines.150filter.fastq -t 48 --fastq --iso -o rattle14M_output RNA mode: 0 Reading fasta file... Done Segmentation fault ] 1/3536141 (2.82794e-05%)

Thanks, John Martinson

jamlover commented 3 years ago

Update to my previous comment. I thought by reducing the amount of input to about 1.77M reads I had finally succeeded as the cluster command ran for over six hours, but then then a seg fault kicked again....

rattle cluster -i all.7Mlines.150filter.fastq -t 48 --fastq --iso -o rattle7M_output RNA mode: 0 Reading fasta file... Done [================================================================================] 1768069/1768069 (100%)99%) [================================================================================] 200218/200218 (100%)95%) Iteration 0.35 complete Segmentation fault ] 1/121301 (0.000824396%)

I have cut the input in half again hoping it succeeds, and if it does, again would like advice on how to merge my results.

John Martinson

EduEyras commented 3 years ago

Hi John,

thanks for your message and for trying to improve the run.

Another thing that helps is to filter out unusually short and long reads.

Removing short reads remove potentially unnecessary excessive comparisons and too large clusters during the correction step.

We've observed that for cDNA-seq sometimes there are some chimeric reads that span two genes. They will give rise to large clusters as well.

If you're running with cDNA-seq reads, RATTLE will try both strands in all comparisons. If you orientate reads to their right strands (e.g. with our method https://github.com/comprna/reorientexpress you can run with the stranded reads with the -rna version This will do fewer comparisons

We have also found that parameters -B 0.5 -b 0.3 -f 0.2 (the configuration that we call c5 in Supp Table S7 of the biorxiv preprint) provide a speedup without losing accuracy. These parameters will improve things in the clustering step.

Usually the bulk of the memory usage is at the error correction step. That's why avoiding artificially large clusters could help. You could also modify the parameters in the error correction step to reduce the size of blocks for correction and reduce memory usage.

I hope this helps

E.

On Wed, 3 Mar 2021 at 12:57, jamlover notifications@github.com wrote:

Update to my previous comment. I thought by reducing the amount of input to about 1.77M reads I had finally succeeded as the cluster command ran for over six hours, but then then a seg fault kicked again....

rattle cluster -i all.7Mlines.150filter.fastq -t 48 --fastq --iso -o rattle7M_output RNA mode: 0 Reading fasta file... Done [================================================================================] 1768069/1768069 (100%)99%) [================================================================================] 200218/200218 (100%)95%) Iteration 0.35 complete Segmentation fault ] 1/121301 (0.000824396%)

I have cut the input in half again hoping it succeeds, and if it does, again would like advice on how to merge my results.

John Martinson

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/RATTLE/issues/25#issuecomment-789365172, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBYF3WPGOPATVV3V66DTBWJPFANCNFSM4WYY3P6Q .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

jamlover commented 3 years ago

Thanks for the response. I will definitely look into your suggestions and see if some of them can be helpful. At present however, it does look as though my current clustering attempt will almost certainly complete successfully. This attempt used 1/64 of the total input reads I have available. If I find that I am able to complete clustering in this manner, as 64 subsets of my total available input reads, would it make sense to run each 1/64th through the subsequent steps of correction and polishing separately, and then try to somehow merge the 64 final sets of isoforms after doing that, or would you suggest that there is a better approach (perhaps using some of the suggestions you made)?

Thanks again, John

Update: Clustering failed on one of my 1/64th fractions; seg fault again. I am now focusing on your suggestion of removing long reads (I had already removed those less that 150 bases from my input).

jamlover commented 3 years ago

Eduardo,

I am providing another update on the progress of my clustering. Dividing my input data into fourths, trimming adapters with porechop, and keeping sequences only in the range of 150 to 40K bases appears to have worked. I also used the parameters you suggested above. I have completed two fourths of my clustering runs and the other two should complete in the next day or two. Each clustering run takes about 5-6 days total, where there are approximately 12-13 million input reads in each of my four input sets, and I am using 24 threads.

I said "appears to have worked" for the following reason. The output from the clustering command for one of my data sets said the following: "144861 gene clusters found". The .csv clusters summary file produced however seems to show about 4.5 times as many clusters, and when I extract the clusters there are 645,571 cluster fastq files created. Can you explain the discrepancy?

Thanks, John

EduEyras commented 3 years ago

Hi,

Thanks for the update.

Are these different cluster IDs, or perhaps the outputs include the consensus sequences and the reads used in each cluster?

I cc Ivan, who may be able to clarify how to best operate with the output

best

Eduardo

On Fri, 26 Mar 2021 at 04:51, jamlover @.***> wrote:

Eduardo,

I am providing another update on the progress of my clustering. Dividing my input data into fourths, trimming adapters with porechop, and keeping sequences only in the range of 150 to 40K bases appears to have worked. I also used the parameters you suggested above. I have completed two fourths of my clustering runs and the other two should complete in the next day or two. Each clustering run takes about 5-6 days total, where there are approximately 12-13 million input reads in each of my four input sets, and I am using 24 threads.

I said "appears to have worked" for the following reason. The output from the clustering command for one of my data sets said the following: "144861 gene clusters found". The .csv clusters summary file produced however seems to show about 4.5 times as many clusters, and when I extract the clusters there are 645,571 cluster fastq files created. Can you explain the discrepancy?

Thanks, John

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/RATTLE/issues/25#issuecomment-807179808, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB44GEDUWHTGEGTZNEDTFNZ2BANCNFSM4WYY3P6Q .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

jamlover commented 3 years ago

Some more fodder for the discussion. Regarding your comment about consensus sequences, the sequences in the the cluster sequence files and clusters in the summary file do not include consensus sequences; they both sum perfectly to the number of input reads, 12,415,585. I then thought that the 144861 number I reported above perhaps reflected the number of clusters with at least X number of reads in them, so I tried a few values. X=4 got me close, but was not quite right; clusters with>= 4 members=142459.

John

EduEyras commented 3 years ago

Thanks John,

by default rattle will produce clusters with >5 reads The corrected reads are those from those clusters, the rest cannot be self-corrected. Perhaps Ivan can clarify where this discrepancy comes from

cheers

Eduardo

On Sat, 27 Mar 2021 at 02:44, jamlover @.***> wrote:

Some more fodder for the discussion. Regarding your comment about consensus sequences, the sequences in the the cluster sequence files and clusters in the summary file do not include consensus sequences; they both sum perfectly to the number of input reads, 12,415,585. I then thought that the 144861 number I reported above perhaps reflected the number of clusters with at least X number of reads in them, so I tried a few values. X=4 got me close, but was not quite right; clusters with>= 4 members=142459.

John

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/RATTLE/issues/25#issuecomment-808321642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBZTDONXXOVCWFMHWYTTFSTWXANCNFSM4WYY3P6Q .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

novikk commented 3 years ago

Hi @jamlover!

Glad you managed to run the clustering step on those big datasets. Regarding your question, it looks like you run RATTLE with the --iso option. This creates isoform clusters instead of gene clusters, but internally RATTLE generates gene-clusters which are then split into isoform clusters. Thus the number you see is the number of gene clusters that RATTLE internally found, but these were later split into 645,571 isoform clusters.

From here, you have several options. If you just want to work with the clusters, you are ready to go, but you should probably filter out those that contain a low number of reads (i.e. <= 5 reads). You can do this with the cluster summary file.

If you want to continue the pipeline, the next step is to correct the reads using the rattle correct option. This will generate a file with the corrected reads (by default, only those reads that belong to a cluster with >5 reads are corrected, but you can set this as a parameter). It will also generate a file with the uncorrected reads, and another file (consensi.fq) that contains one consensus sequence for each of the corrected clusters.

Once you have this consensi file from each part of your dataset, you can merge them together and run a rattle polish step with this file. This will generate a final "transcriptome.fq" file with the final transcripts from your dataset. If the polish step takes too long, you might want to do another round of cluster + correct with the merged consensi files.

If you need the actual quantification of the whole dataset you might need to do something a bit different. If that's the case, contact me via email when you have all the consensi.fq files and I will help you with that, since it's not implemented yet in the main RATTLE program.

Ivan

jamlover commented 3 years ago

Ivan,

Thanks for the explanation. The gene vs. isoform thing makes total sense, though that does seem like a lot of "genes". We'll see what the filtering does in that regard. Also it is good to know that you expect that I can merge everything at the polish step. I was wondering about that.

I actually started two of my four correction runs last night (I have two good workstations available) but unfortunately our facility had a power outage this weekend so looks like I will be starting over, probably on Monday. With a clear path forward though, I expect I will have final results before too long. I'll let you know how it goes.

Thanks again, John

From: novikk @.> Sent: Saturday, March 27, 2021 6:20 AM To: comprna/RATTLE @.> Cc: Martinson, John @.>; Mention @.> Subject: Re: [comprna/RATTLE] Memory issue (#25)

Hi @jamloverhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjamlover&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235461335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hSCLbsyT%2FAFlcLozzD0jK1qUxNSmejXK8wUCp1bT5gI%3D&reserved=0!

Glad you managed to run the clustering step on those big datasets. Regarding your question, it looks like you run RATTLE with the --iso option. This creates isoform clusters instead of gene clusters, but internally RATTLE generates gene-clusters which are then split into isoform clusters. Thus the number you see is the number of gene clusters that RATTLE internally found, but these were later split into 645,571 isoform clusters.

From here, you have several options. If you just want to work with the clusters, you are ready to go, but you should probably filter out those that contain a low number of reads (i.e. <= 5 reads). You can do this with the cluster summary file.

If you want to continue the pipeline, the next step is to correct the reads using the rattle correct option. This will generate a file with the corrected reads (by default, only those reads that belong to a cluster with >5 reads are corrected, but you can set this as a parameter). It will also generate a file with the uncorrected reads, and another file (consensi.fq) that contains one consensus sequence for each of the corrected clusters.

Once you have this consensi file from each part of your dataset, you can merge them together and run a rattle polish step with this file. This will generate a final "transcriptome.fq" file with the final transcripts from your dataset. If the polish step takes too long, you might want to do another round of cluster + correct with the merged consensi files.

If you need the actual quantification of the whole dataset you might need to do something a bit different. If that's the case, contact me via email when you have all the consensi.fq files and I will help you with that, since it's not implemented yet in the main RATTLE program.

Ivan

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcomprna%2FRATTLE%2Fissues%2F25%23issuecomment-808707165&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235461335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mi0dFXbLLbwHhJ%2BKNGc4qOGluGFcaBYu8Rf7qcU27Qo%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGFBJFHFXUVVVJUYJJG24R3TFWWOFANCNFSM4WYY3P6Q&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235471302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EwOR5K0SmwQr5rHWjhLCC1hTLZNle8EHywmGBGbuHrw%3D&reserved=0.