baoe / AlignGraph

Algorithm for secondary de novo genome assembly guided by closely related references
166 stars 23 forks source link

Parallelizing read grouping post alignment #23

Open ramprasadn opened 8 years ago

ramprasadn commented 8 years ago

Hi,

I'm running AlignGraph for one of my projects and it has been running for quite sometime now. Upon closer inspection, I realized that the time consuming step is where AlignGraph groups reads that map to reference contigs into separate files (tmp/_reads_genome* files). This step is taking roughly four minutes for each contig in my case. I have approximately 3000 contigs and that means AlignGraph will be at this stage for atleast 200 hours. So I have a suggestion, perhaps it would be nice to have this step parallelized? If AlignGraph could independently handle multiple instances of this sorting, I could use more threads and get past this step faster. I have at least ten reference based assemblies to make and I would like for this step to not be the rate limiting one.

Thank you very much, Ram

baoe commented 8 years ago

Hi, Ram,

Maybe you could try PBLAT or Nucmer for AlignGraph? The former is the parallelized version of BLAT and the latter is much faster.

Best, Bao


From: ramprasadn [notifications@github.com] Sent: Tuesday, June 21, 2016 4:47 AM To: baoe/AlignGraph Subject: [baoe/AlignGraph] Parallelizing read grouping post alignment (#23)

Hi,

I'm running AlignGraph for one of my projects and it has been running for quite sometime now. Upon closer inspection, I realized that the time consuming step is where AlignGraph groups reads that map to reference contigs into separate files (tmp/_reads_genome* files). This step is taking roughly four minutes for each contig in my case. I have approximately 3000 contigs and that means AlignGraph will be at this stage for atleast 200 hours. So I have a suggestion, perhaps it would be nice to have this step parallelized? If AlignGraph could independently handle multiple instances of this sorting, I could use more threads and get past this step faster. I have at least ten reference based assemblies to make and I would like for this step to not be the rate limiting one.

Thank you very much, Ram

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/baoe/AlignGraph/issues/23, or mute the threadhttps://github.com/notifications/unsubscribe/AFGl8VbaMW2Ejk9rHpF13ufjeBz2w7HQks5qN89IgaJpZM4I6pK5.

ramprasadn commented 8 years ago

Thanks for your response, Bao.

I tried that but for some reason Aligngraph seems to be going for blat instead. When I do top to check up on the processes, I can see that pblat is invoked before aligning a contig to the reference genome, but for some reason it then quickly changes to blat. I think something's off here, as the _contigsgenome..psl.tmp._ files are empty. I'm using the latest version of pblat from https://github.com/icebert/pblat. Considering the fact that there source was from a year ago, I think I'm using the right version, but there is no error message on the terminal so there is no way for me to tell what's happening there. What do you suggest? I've checked and I know that I have pblat in the path.

In my run, the initial blat and bowtie runs were finished in about a day and half, its the read grouping post alignment has been going on for about five days and at this rate, it will take three more days to finish. It would be great if I could get pblat to work as that will allow the initial stages to finish in a couple of hours and perhaps in a later version read grouping could be parallelized as well, something that an user could specify. Even if I only could use four threads it will be roughly three times faster. Just a suggestion :)

Cheers, Ram

baoe commented 8 years ago

Hi, Ram,

If PBLAT switches to BLAT automatically, it means PBLAT meets some problem and cannot proceed (e.g. crash). I guess after the process of the first contig, PBLAT crashed. So, maybe what we can do is waiting for a more stable PBLAT.

Best, Bao


From: ramprasadn [notifications@github.com] Sent: Tuesday, June 21, 2016 8:46 AM To: baoe/AlignGraph Cc: Bao; Comment Subject: Re: [baoe/AlignGraph] Parallelizing read grouping post alignment (#23)

Thanks for your response, Bao.

I tried that but for some reason Aligngraph seems to be going for blat instead. When I do top to check up on the processes, I can see that pblat is invoked before aligning a contig to the reference genome, but for some reason it then quickly changes to blat. Perhaps something's off? I'm using the latest version of pblat from https://github.com/icebert/pblat. Considering the fact that there source was from a year ago, I think I'm using the right version, but there is no error message on the terminal so there is no way for me to tell what's happening there. What do you suggest? I've checked and I know that I have pblat in the path.

In my run, the initial blat and bowtie runs were finished in about a day and half, its the read grouping post alignment has been going on for about five days and at this rate, it will take three more days to finish. It would be great if I could get pblat to work as that will allow the initial stages to finish in a couple of hours and perhaps in a later version read grouping could be parallelized as well, something that an user could specify. Even if I only could use four threads it will be roughly three times faster.

Cheers, Ram

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/baoe/AlignGraph/issues/23#issuecomment-227482773, or mute the threadhttps://github.com/notifications/unsubscribe/AFGl8WmEXBEaT20T6eWqeusNwsmQhS8xks5qOAdvgaJpZM4I6pK5.

ramprasadn commented 8 years ago

That's probably it. Hopefully, their new version will fix this issue.

Thanks, Ram