FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Question about analysis pipeline when using SNPsplit #31

Closed AIBio closed 4 years ago

AIBio commented 4 years ago

Hi~

Recently, I took over an analysis project in which the reads from ChIPseq are needed to be splited into paternal and maternal ones(C57BL6-NJ and PWK-PhJ strain mouse). I'm wondering whether my analysis pipleline is wrong. So, could you help me to check the whole process???

  1. filter and trim the reads before alignment

  2. prepare the N-masked genome: in this procedure, I found the SNPsplit_genome_preparation tools in SNPsplit can process one genome one by one by specifying the strain name. So, I firstly masked the mm10 genome with PWK-PhJ SNP information downloaded from sanger institute website (file name: mgp.v5.merged.snps_all.dbSNP142.vcf.gz), and then masked the fa file outputed by previous step with C57BL6-NJ SNP information. Finally, I merged the fa files in the output directory of last maskking step into the new genome masked with maternal and paternal SNP sites.

  3. align the reads to the new genome by bowtie2 with options [--end-to-end -N1 -q]

  4. remove duplicates by sambamba

  5. split the allele reads: in this step, I'm not sure how to input the snp file. I'm planning to merge the txt files from the step 2 in which SNPsplit seem to output the SNP information as txt files in each chromosome. But there are some questions: Can SNPsplit split the alignment into paternal and maternal ones at once? If true, how to provide the snp file with parameter --snp_file? or Should I run twice and split the maternal and paternal alignments seperately?

Can you help me to solve these problems???

Thank you sincerely!

Hanwen

FelixKrueger commented 4 years ago

Hi Hanwen,

To 1: fine

To 2: There should be no need to run two separate genome preparation processes, and merge them yourself. Instead, you can specify --dual_hybrid and then provide both strains at the same time. This would be the command on our machine:

SNPsplit_genome_preparation --vcf /bi/scratch/Genomes/Mouse/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference /bi/scratch/Genomes/Mouse/GRCm38/ --strain C57BL_6NJ --strain2 PWK_PhJ

This would leave dual-hybrid indexes ready to be indexed, and also produce a SNP file that can be readily fed into SNPsplit at the splitting stage later on. The name should be something like: all_PWK_PhJ_SNPs_C57BL6NJ_reference.based_on_GRCm38.txt.gz.

In the splitting step, genome 1 would be the first strain, i.e. C57BL_6NJ, and genome2 would be PWK_PhJ

To 3: OK. (I think -N 1 is not needed but that's up to you)

To 4: I am also not sure if you want to remove duplicates from a ChIP-seq experiment straight away, as for highly enriched peaks this might take out the dynamic range of your signal. But this is again your choice.

To 5: If you would re-run step 2 you can use the file generated in with --snp all_PWK_PhJ_SNPs_C57BL6NJ_reference.based_on_GRCm38.txt.gz, and you should then straight away get your maternal and paternal specific reads.

I hope this will all work as expected. All the best, Felix

AIBio commented 4 years ago

Hi Felix,

Thank you for your useful reply. It's only good news for me after a frustrated and weary day.

I will follow your instructions to finish this pipeline.

If any problems happen, I will share my experience and personal solution on this issue to help the people who may need it.

Hanwen Yu

Thanks again, sincerely !

AIBio commented 4 years ago

Hi Felix,

As you recommendation, I run the SNPsplit_genome_preparation tools (command as below).

strain1=C57BL_6NJ strain2=PWK_PhJ vcf_file=/home/yuhanwen/genome/mgp/vcf/mus_musculus/mgp.v5.merged.snps_all.dbSNP142.vcf.gz ref_genome_dir=/home/yuhanwen/genome/ensembl/release97/mus_musculus/dna_sequence/ SNPsplit_genome_preparation --vcf_file ${vcf_file} --reference_genome ${ref_genome_dir} --dual_hybrid --strain ${strain1} --strain2 ${strain2}

and then I got the results you mentioned before. However, I'm confused which fa files in differenct directory should be indexed by aligner (bowtie2).

The directories includes "C57BL_6NJ_PWK_PhJ_dual_hybrid.based_on_GRCm38_N-masked" and "C57BL_6NJ_PWK_PhJ_dual_hybrid.based_on_GRCm38_full_sequence".

And sholud the file, "all_PWK_PhJ_SNPs_C57BL_6NJ_reference.based_on_GRCm38.txt", be used in the splitting step?

Hanwen Yu

FelixKrueger commented 4 years ago

For indexing you will need the files in the folder: C57BL_6NJ_PWK_PhJ_dual_hybrid.based_on_GRCm38_N-masked (since the splitting of file is based on alignments to the N-masked genomes).

And yes, the file all_PWK_PhJ_SNPs_C57BL_6NJ_reference.based_on_GRCm38.txt should be used as --snp_file for SNPsplit.

AIBio commented 4 years ago

Hi Felix, SNPsplit. what an amazing tool! I got the results from one of my samples (Oocyte from C57BL6-NJ strain female mouse named genome1 when masked the genome with 'N' ). The results show that the 48.64% mapped reads come from genome1 and only 0.44% reads from genome2 (PWK_PhJ strain male mouse). This results are consistent with the biological background of samples and prior knowledge. But I found 50.06% of mapped reads are unassignable. So, I have some question for this results:

  1. Is the proportion of unassignable reads normal? or What can I do to obtain more assigned reads and do not harm the precision of splitting reads into maternal and paternal ones.

  2. And should I improve the sequencing depth in order to get more reads after this step since around half of reads are unassignable and unusable for following analysis?

Finally, I post my code at splitting step and retuned results:

my code

###### >>>>>> 4. Split the reads into paternal or maternal ones
workdir=/home/yuhanwen/bioinfo/project-2cell-like/chipseq/gse71434
cat ${workdir}/metadata/gse71434_chipseq_srr_all.txt | awk '(NR >= 2){print $1}' | while read srr
do
    in_bam=${workdir}/results/sambamba/${srr}_psorted.bam
    out_dir=${workdir}/results/snpsplit
    snpfile=/home/yuhanwen/genome/mgp/vcf/mus_musculus/all_PWK_PhJ_SNPs_C57BL_6NJ_reference.based_on_GRCm38.txt
    log_file=${workdir}/logs/snpsplit/${srr}_split.log
    SNPsplit --snp_file ${snpfile} -o ${out_dir} ${in_bam} > ${log_file} 2>&1
done

report of splitting step

Input file: 'SRR2147241_psorted.bam' Writing allele-flagged output file to: 'SRR2147241_psorted.allele_flagged.bam'

Allele-tagging report

===================== Processed 20004392 read alignments in total Reads were unaligned and hence skipped: 0 (0.00%) 10013398 reads were unassignable (50.06%) 9730381 reads were specific for genome 1 (48.64%) 87852 reads were specific for genome 2 (0.44%) 127475 reads did not contain one of the expected bases at known SNP positions (0.64%) 172761 contained conflicting allele-specific SNPs (0.86%)

SNP coverage report

=================== SNP annotation file: /home/yuhanwen/genome/mgp/vcf/mus_musculus/all_PWK_PhJ_SNPs_C57BL_6NJ_reference.based_on_GRCm38.txt SNPs stored in total: 20280499 N-containing reads: 10106237 non-N: 9885923 total: 20004392 Reads had a deletion of the N-masked position (and were thus dropped): 12232 (0.06%) Of which had multiple deletions of N-masked positions within the same read: 15 Of valid N containing reads, N was present in the list of known SNPs: 19424217 (99.97%) N was not present in the list of SNPs: 5251 (0.03%)

FelixKrueger commented 4 years ago

I'm glad it worked in out hands!

The fraction of reads that are not assignable depends mainly on the following three factors:

  1. The number of SNPs between your two strains
  2. The read length used
  3. The library type, i.e. single-end or paired-end

To 1: PWK_PhJ seems to have a decent amount of SNPs compared to Black6 (C57BL_6NJ is virtually identical to the standard Black6 reference genome), where you would on average expect a read of ~100bp to cover at least one SNP. This will not always be the case, as there probably are regions with many more than one, and also regions of the genome that do not have any discrimanatory SNPs at all. Such regions will unfortunately remain undetectable irrespective of how deep you sequence.

To 2: As a general rule of thumb: the longer the reads, the higher the detection rate. 100, 125 or 150 bp should all be fairly good options.

To 3: In terms of allele-specific sorting, paired-end sequencing yields much higher assignment rates than single-end sequencing. This is simply because both R1 and R2 of a read pair can be used to assign a read pair, eventually increasing the chances of covering a SNP. For 2x125 bp sequencing in CASTxBlack6 hybrids, which have a very comparable number of SNP to PWK_PhJ, we reached some ~70% of reads that could be assigned to either genome, which is a fantastic rate.

So yea, if you wanted to get a bit more data to work with I would probably recommend re-sequencing the library with fairly long PE libraries (from your the report you linked you seem to have performed single-end sequencing only?).

Does that make sense?

AIBio commented 4 years ago

Hi Felix, Thank you for your clear explanation. But, I still have the last question: If I have Pair-end data, then should I have to use the --paired parameter in the spliting step? Hanwen

FelixKrueger commented 4 years ago

Yes, that's exactly what is required. I think I have recently added a single-end/paired-end auto-detection, as this caught my out myself for one or two runs, but adding --paired will certainly not hurt. All the best, Felix