performance following minimal working example

jon-xu commented 2 years ago

Hi Adnan,

Thanks again for the tool! I have been able to detect the polyA tails with a few datasets. Now I am working on a relatively large cDNA dataset, which is about 350GB in size.

I have tried to increase the num_cores from 16 to 32 and 64, with 20GB per core. But I didn't notice big improvement in speed, which means it will need more than 160 hours for our datasets based on the progress in the log file.

Do you have any advice on making it even faster, please? Thanks! Jon

adnaniazi commented 2 years ago

Hi Jon,

Thanks for using tailfindr.

Sorry to hear about performance issues. Unfortunately it gets slow for big datasets at the moment. I would recommend splitting the big dataset in lets say four folders and then applying tailfindr individually on these folders. Not an optimal solution, I understand. Sorry for that. It is in my todo list to improve performance for large datasets.

Best, Adnan

maximus-sci commented 2 years ago

Have you considered adapting tailfindr to work with SLOW5 files? Nanopolish recently added support for this and it improves performance substantially.

https://www.nature.com/articles/s41587-021-01147-4

https://github.com/hasindu2008/slow5tools

adnaniazi commented 2 years ago

It's on my to do list, when I get time for it.

jon-xu commented 2 years ago

Thanks Adnan!

We ended up with subsampled the fast5 files.

Will try slow5 when have a chance!

Cheers,

Jon

On 2 Jul 2022, at 01:11, Adnan Niazi @.***> wrote:

It's on my to do list, when I get time for it.

— Reply to this email directly, view it on GitHubhttps://github.com/adnaniazi/tailfindr/issues/29#issuecomment-1172448943, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJXMHDBUTZMLXKAZT2WB7C3VR4DALANCNFSM5XLYZGRA. You are receiving this because you authored the thread.Message ID: @.***>

jon-xu commented 2 years ago

Thanks Adnan!

It’s still manageable - I’ll just let it run and get you updated!

Cheers,

Jon

On May 31, 2022, at 16:30, Adnan Niazi @.***> wrote:

Hi Jon,

Thanks for using tailfindr.

Sorry to hear about performance issues. Unfortunately it gets slow for big datasets at the moment. I would recommend splitting the big dataset in lets say four folders and then applying tailfindr individually on these folders. Not an optimal solution, I understand. Sorry for that. It is in my todo list to improve performance for large datasets.

Best, Adnan

— Reply to this email directly, view it on GitHubhttps://github.com/adnaniazi/tailfindr/issues/29#issuecomment-1141717675, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJXMHDG2ECI3CP6WTOF7H7DVMWWXFANCNFSM5XLYZGRA. You are receiving this because you authored the thread.Message ID: @.***>

adnaniazi / tailfindr

performance following minimal working example #29