bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement
27 stars 8 forks source link

Speed up PAPRICA by only using unique sequences? #47

Closed chassenr closed 7 years ago

chassenr commented 7 years ago

Hi, I have a very large 16S dataset, and I was wondering if there is an option to use PAPRICA only with unique 16S sequences to speed up the alignment step in the beginning? If I understand the logs correctly after the cmalign the program continues with unique sequences anyway... Of course, PAPRICA will work with unique 16S sequences now, but as far as I could see, there is no option to take sequence counts (abundance) of a non-redundant dataset into account when calculating the metabolic profile.

Thanks!

Cheers, Christiane

bowmanjeffs commented 7 years ago

Christiane, Sorry for the delayed reply, I missed the email notification for this thread. There is not currently an option to limit to unique sequences, however, pplacer does this automatically during phylogenetic placement (the most time-intensive part of the paprica pipeline). If you have too many reads to reasonably run paprica on the available resources you could try the paprica Amazon machine image. This will allow you to run paprica on a virtual computer that might have more cores than your physical computer. Let me know if you need more details.

Cheers, Jeff