dincarnato / RNAFramework

RNA structure probing and post-transcriptional modifications mapping high-throughput data analysis
http://www.rnaframework.com
GNU General Public License v3.0
31 stars 11 forks source link

rf-count multiprocessing #18

Closed coffeebond closed 2 years ago

coffeebond commented 2 years ago

I wonder if the multiprocessing for rf-count is working properly.

I tried to run rf-count with either -p 20 or -p 4 -wt 5, but I didn't see it using 20 cores. It took about 6 hours to go through ~100 million mapped reads in a pre-sorted bam file.

Is this normal?

dincarnato commented 2 years ago

Hi coffeebond,

the -wt option in rf-count only affects BAM file sorting. Unfortunately, right now, reading and processing of the BAM file is done on a single thread. Bringing it to multithread would need a complete rewrite, something we have planned to do, but that won't happen any time soon. I'm sorry. Best,

Danny

coffeebond commented 2 years ago

I see. I thought it uses different cores for examining the alignments and finding mismatches.

Thanks for the explanation.

dincarnato commented 2 years ago

Not a present, but it will in the future. All the other modules are very well optimized for multithreading, rf-count is the one that still needs a speed up.

dincarnato commented 2 years ago

One last note. A possible workaround would be to split your file into 20 chunks (I assume you have 20 processors), process them in parallel with rf-count (-p 20), then merge the resulting RC files with rf-rctools merge.

coffeebond commented 2 years ago

Interesting thought, I guess that's something I can do to make this step much faster.

coffeebond commented 4 months ago

Sorry to re-open this issue.

I assume the "rf-count" still cannot use multi-processing to speed up the process. I followed your suggestions by splitting the bam files into 40 chunks and using "rf-count" to process each chunk.

However, merging 40 rc files together with "rf-rctools merge" also takes very very long time. Do you know if this is normal? Each of my rc files is about 200MB. The merged rc file only accumulates to 100KB after 15 min. Assuming the final rc file has a similar size, this will take 500 hours to finish...

dincarnato commented 4 months ago

Unfortunately not. I do not have much bandwidth to implement that at the moment, but it's definitely on my todo list. I am not sure about why on your system this is so slow. 200MB RC files are quite large though... what's in those files? How many transcripts? The bottleneck is, unfortunately, the speed of disk read/write.

dincarnato commented 4 months ago

I just tried on my system, with 40 RC files of 40MB. Merging took less than 5 min (it's writing something like 150kb/sec). It looks like indeed the problem might be with your drive read/write speed.

coffeebond commented 4 months ago

Thanks for checking.

I can double-check my code but I highly doubt it's my drive read/write speed.

Do you think if it has something to do with the number of entries? I have ~500000 chromosomes/contigs in my reference. Each rc file has ~1.4 million lines.

coffeebond commented 4 months ago

If I use "rf-rctools view" for each file and merge count data for all 40 rc files, it takes about 1 hour to complete. Each file takes about 70~80 seconds to process.

dincarnato commented 4 months ago

500,000 entries are really a lot. So the problem is that you have to do 500,000*40 reading operations and 500,000 write, so that's the bottleneck.

I have a possible workaround in mind, but it would require you to rerun rf-count.

I will work on this tomorrow and update you asap.

dincarnato commented 4 months ago

Hi @coffeebond,

I was able to recreate your scenario. You mentioned 200MB RC files, with 500,000 entries. It looks like, to have such a file, you have, on average, sequences that are 50bp long. Is that correct?

Indeed, the issue was the very high number of seek() operations. I have now made this process significantly more efficient. There is still a lot of read/write involved, but now 40 files are merged in ~30min.

Can you please git pull and let me know if this solves?

Best, Danny

coffeebond commented 4 months ago

Hi Danny,

I just pulled.

Did you change the command syntax? When I used the same command as it was yesterday, I got the error [!] Error: Provided RCI index file does not exist. The files do surely exist. The RCI files are still separated by comma with no space, right?

dincarnato commented 4 months ago

Hi @coffeebond,

yes, in order to make this more efficient, now the program expects RC files with the same exact structure. Therefore, no index file is required anymore. I assume this should be the case for your files as they were all generated using the same reference, so they should all be identical in structure. Just remove the -i parameter and try again.

Danny

coffeebond commented 4 months ago

Hi Danny,

I tried that and it worked. It took about 25 min to finish the merge job.

Thanks!

dincarnato commented 4 months ago

Glad we fixed this, and thanks for reporting!