Running SNPsplit in parallel

Grishnahk commented 3 years ago

Dear Felix,

I am using SNPsplit with WGS data from an F1 mouse cell (BL6/spretus) line with a very high SNP density (41mio). Mapping using the N-masked genome works very well and X-linked reads map only to the maternal chromosome, which would be expected in this cross. Also coverage is quite even across chromosomes.

I now want to separate the reads overlapping SNPs to specific alleles and am using SNPsplit for this. These are deep WGS samples (60x, paired-end 150) and this was taking forever with one sample, so I parsed the reads only overlapping unfiltered SNPs in the UV treated samples (~10k) and am running SNPsplit on them now to assign mutations to one genome for only this subset of the library. This is taking very long as well and am curious if you have a method for parallelizing SNPsplit to speed up the process.

Of course it would be possible to split reads again on chromosome and run one at a time in several processes and concatenate results, but was wondering if you had possibly developed an alternative method. I apologize in advance it I am missing something obvious and would greatly appreciate any input. Thanks!!

Cheers, Paul

FelixKrueger commented 3 years ago

Hi Paul,

I recently did a fairly extensive B6/Spretus comparison, and I can't recall it taking very long. Granted, Spretus is the mouse strain with most SNPs, which is around twice as many as CAST, but I would still not expect this splitting to take all that long... Which time frame are we taking about? In everything I have ever looked at, I can't recall the process to have taken more than a few hours, even though I have admittedly never dealt with 60x files in a Spretus cross.

Do you get sensible data out of this at the end, or are many reads not usable because of conflicting SNPs in the read pairs? A file with 10K reads should certainly not take more than a few seconds... Which version of SNPsplit are you using? Have you tried the latest dev version?

I am afraid SNPsplit doesn't really have any built-in parallelisation mechanism, but I suppose you should be able to split the mapped BAM file into several parts (make sure not to disrupt read pairs in this process), and then run run SNPsplit on several of these subset BAM files in parallel. This should result in a several files called allele_flagged.bam, which can them be concatenated into one merged allele-flagged BAM file again. You should then be able to feed this into tag2sort, which is a single-pass operation that will simply distribute the allele-flagged BAM file into genome1, genome2 and unassigned files.

If you'd like to share some files with the, e.g. the 10K one, do let me know via email and I can set up an FTP server for you to take a look myself.

FelixKrueger commented 3 years ago

I hope you got this sorted somehow.

LuJiansen commented 3 years ago

Hi Felix, I ran into the same problem as @Grishnahk said when I using SNPsplit in my data, however when I remove the --verbose option, everything goes well.

FelixKrueger / SNPsplit

Running SNPsplit in parallel #50