Create a sampling pipeline to speed up CARP

lucventurini commented 6 years ago

Dear Ms. Zeng, I read with a lot of interest your CARP paper, as I am developing a genome annotation pipeline in the group of @swarbred together with @gemygk. We are currently using RepeatModeler for our repeat annotation step, but are not enthusiastic users, and we are looking around for alternatives. CARP is of course on our radars. We are especially interested in running something faster than the classic RepeatModeler.

I tested CARP on Arabidopsis thaliana on our cluster, and I was impressed by its speed - using four cores, I was able to finish the analysis in 15 minutes instead of the 56 hours required by RepeatModeller using the same exact resources.

However, looking at your paper and at the documentation on this site, it looks like CARP scales poorly with the size of the genome - I would imagine because of the quadratic nature of the problem. RepeatModeller solves the same problem (poorly) by fetching multiple samples of the genome of increasing size and iteratively masking the sequence with the libraries found in the previous rounds.

My question is, would it be sensible, feasible, and easy to implement a similar pipeline for Krishna + Igor? Given the speed of Krishna, and my results above, the pipeline would not need to chunk and sample as aggressively as RepeatModeler does - even an initial sample of 200-400 Mbps would probably finish very quickly and provide a good starting library.

Thank you for your kind attention.

Kind regards

Luca Venturini

lucventurini commented 6 years ago

Dear Lu, First of all, my sympathies for the move – I recently moved myself (although not as far as to have to change continent!) and I totally understand the stress.

It is good to hear about the progress on Igor. Please let me know if there is any progress on it – we would be keen to have a more efficient way of predicting repeats in our genomes!

Many thanks

Luca

From: LuZeng notifications@github.com Reply-To: carp-te/carp-documentation reply@reply.github.com Date: Thursday, 16 August 2018 at 16:46 To: carp-te/carp-documentation carp-documentation@noreply.github.com Cc: Luca Venturini Luca.Venturini@earlham.ac.uk, Author author@noreply.github.com Subject: Re: [carp-te/carp-documentation] Create a sampling pipeline to speed up CARP (#19)

Dear Luca, Sorry for the late reply, I was busy with moving overseas, and it takes a while for me to finally settled down.

Thank you for your email, I'm happy to hear that you're interested with our CARP. Yeah, CARP scales mainly based on the complexity of repeats in the genomes, not by the genome size. We're currently working on CARP to speed up the igor process, and testing it on PacBio data. I will definitely keep you updated if there has any progress on it.

Many thanks, :) Lu

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/carp-te/carp-documentation/issues/19#issuecomment-413591735, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIfFHVOaZT1CbfRjxex1dRkIcifVoUTAks5uRZPNgaJpZM4V4W1S.

luzengAdelaide commented 6 years ago

Dear Luca,

By the way, you can also take a look of https://github.com/davidaray/carp_for_raylab, which incorporated our pipeline into bash scripts.

Cheers, Lu

carp-te / carp-documentation

Create a sampling pipeline to speed up CARP #19