Closed yeroslaviz closed 4 years ago
Hi @yeroslaviz
Generally, the difference between the N-masked genome and the full sequence is that the N-masked genome is required for SNPsplit to work. The full-sequence is more or less just required if you are planning to do something else with the strain sequence, here CAST. So with regard to SNPsplit you should only use the N-masked genome, a run against the full sequence genome should return absolutely nothing useful.
So I hope that aligning to the N-masked version of the genome will fix all your problems. (your command line looks fine otherwise).
Regarding the location of the SNPs (like in #24): I have used the SeqMonk software to import the SNPs as and determine the overlap with Exons (it's really only a few clicks).
Hi, thanks for the fast reply. It is still running so i can't say, But i have fund something else which I can't explain. I have mapped the sample with bowtie2 before I ran SNPsplit against the same genome (without the N-masking) with and without a duplication removal step (with picard). I than looked at the bam and bigwig/bedgraph files in IGV.
The region inside the red square was found only in the full-seq converted bam file. There are many reads here, but it seems to be completely missing the files mapped using bowtie2 in both the original and the de-duped bam.
Any idea what can cause this behavior?
[chr1 in IGV]
I will upload the results from the N-masked genome run as soon as it is done.
thanks again for the great tool
Assa
Hi Assa,
The screenshot you posted is highly puzzling. Really the only explanation I would have for this is that this could be a region with really tons and tons of SNPs, so that the alignments get rejected because of too many mismatches? I don't think this is really a SNPsplit issue itself though, or at least I couldn't think of anything else.
Could you also import the positions of the SNPs, or grep for that region in the SNP (all_SNPs_CAST...) to be sure?
two things -
one, I'm not familiar with Seqmonk, so I'm not really sure how to do it. I have uploaded the all_SNP file to Seqmonk as an annotation set (to where the gene/mRNAs are). But I don't seems to manage to overlap them. I have defined the probes using the feature probe generator
, than quantitated them and ran a Feature Filter
marking the allSNP and mRNA as the features to design around choosing to select probes that are overlapping.
But this didn't seem to work well.
two, I have converted the SNP file into bed and uploaded it to IGV together with the other files. As you an see, exactly at this position it is clear of SNPs all together.
Is it possible, that the reads were changed during the SNP detection in full_sequence mode so much, that they were matched to a different position in such a number?
To 1) There should be plenty of training material out there that explains the basics of SeqMonk, including the manual, videos on YouTube etc.
I would probably have imported the mapped file in BAM format, filtering out reads with a MAPQ < 20 to get rid of multi-mapped reads. Then import the SNPs as you did as annotation track. Then you could have done any type of running window probe generation (e.g. 10000 wide probes, genome wide, followed by a read count quantitation (or do the Wiggle Plot Pipeline, which does those two things in a single operation. Next, go to your region of interest and see the quantified number of reads, check if there are any SNPs in that region etc. Pretty much what you have now done in IGV already.
To 2) I'm not sure why you think that something happened to your reads at all, simply mapping with STAR shouldn't do anything at all. All that the full_sequence genome preparation does is to incorporate the CAST SNPs to the GRCm38 reference sequence. I really don't know what happened there, but the complete absence of SNPs in that region is normally a sign that this region might be highly repetitive, and therefore the alignments might not be very reliable. I have a few additional comments:
Did the N-masked alignments finish by now, and did you get better allele-splitting stats?
Hi thanks, I'll try the two points you mention.
The N-masked is still running. I'll report back when it is done.
Hi again,
I have another question - is it usually the case, that it runs for such a long time?
It is already running on the cluster for several days.
How long does it should usually take to finish a run with 44685408 reads?
thanks Assa
samtools flagstat SNPsplit/W36_Nmasked/W36_N-masekd.Aligned.out.sortedByName.bam
44685408 + 0 in total (QC-passed reads + QC-failed reads)
2257650 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
44685408 + 0 mapped (100.00% : N/A)
42427758 + 0 paired in sequencing
21213879 + 0 read1
21213879 + 0 read2
42427758 + 0 properly paired (100.00% : N/A)
42427758 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Hi Assa,
Are we talking about STAR alignments that are running for a long time? SNPsplit shouldn't take more than a few minutes, or at least well under one hour...
As far as I can see it, SNPsplit is running now.
on the STDOUT I can see
...
Now looking at the following read:
8 34119222 148M ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC 42N54N50
Read is a single continuous match: 148M
MD-tag: 42N54N50
42 54 50
element is 42
Start: 34119222 pos: 42
SNP was present: ref: C snp: G
pos: 42
ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC
base in read was: C
genome 1 specific!
now adjusting the position for the N itself
pos right now: 43
element is 54
Start: 34119222 pos: 97
SNP was present: ref: A snp: G
pos: 97
ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC
base in read was: A
genome 1 specific!
now adjusting the position for the N itself
pos right now: 98
~~~~~~~~~~~~~~~~~~~~
...
But this goes now for several days
I'm not sure, whether or not I should stop it and restart the command.
Ahhh, now that makes sense. It appears like you started SNPsplit with the --verbose
function. This produces lots of verbose output and comes with several sleep statements... Just stop it, re-run the same command but drop --verbose
. It should then finish within the hour.
Good luck!
Hi,
you were right. It is now done :-)
Now I need to understand the output. I find the number of unassigned reads quite high, but I don't have any comparisons.
Do you think it make sense to run the mapping with bowtie2 instead? Would it performs better than star without the soft-clipping?
I attach below the complete report.
thanks for all the help Assa
Input file: 'W36_N-masekd.Aligned.out.bam'
Writing allele-flagged output file to: 'W36_N-masekd.Aligned.out.allele_flagged.bam'
Allele-tagging report
=====================
Processed 44685408 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
20279202 reads were unassignable (45.38%)
20029556 reads were specific for genome 1 (44.82%)
4320474 reads were specific for genome 2 (9.67%)
24594 reads did not contain one of the expected bases at known SNP positions (0.06%)
56176 contained conflicting allele-specific SNPs (0.13%)
SNP coverage report
===================
SNP annotation file: /fs/pool/pool-bcfngs/genomes/Mmu.GrCm38/SNPsplit/all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total: 20668547
N-containing reads: 24411040
non-N: 20254608
total: 44685408
Reads had a deletion of the N-masked position (and were thus dropped): 19760 (0.04%)
Of which had multiple deletions of N-masked positions within the same read: 14
Of valid N containing reads,
N was present in the list of known SNPs: 52082500 (99.96%)
N was not present in the list of SNPs: 23418 (0.04%)
Input file: 'W36_N-masekd.Aligned.out.allele_flagged.bam'
Writing unassigned reads to: 'W36_N-masekd.Aligned.out.unassigned.bam'
Writing genome 1-specific reads to: 'W36_N-masekd.Aligned.out.genome1.bam'
Writing genome 2-specific reads to: 'W36_N-masekd.Aligned.out.genome2.bam'
Writing reads with conflicting number of SNPs to: 'W36_N-masekd.Aligned.out.conflicting.bam'
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 22342704
thereof were read pairs: 22342704
thereof were singletons: 0
Reads were unassignable (not overlapping SNPs): 7377690 (33.02%)
thereof were read pairs: 7377690
thereof were singletons: 0
Reads were specific for genome 1: 12262448 (54.88%)
thereof were read pairs: 12262448
thereof were singletons: 0
Reads were specific for genome 2: 2642665 (11.83%)
thereof were read pairs: 2642665
thereof were singletons: 0
Reads contained conflicting SNP information: 59901 (0.27%)
thereof were read pairs: 59901
thereof were singletons: 0
Hi Assa,
now that looks a lot better! Just generally, the number of reads that can be assigned to different alleles depends on a number of things:
You have here close to 70% of assignable reads, which I find spectacularly good to be honest. The weird thing is that you've got ~55% specific for Black6, but only ~12% specific for CAST. This would argue that you are not looking at clean F1 hybrid cells with a 50/50 split, but probably either something like cells that had been re-crossed several times. Is that a possibility?
If you are not exactly sure about the genetic background of your samples you could also take a look at https://github.com/FelixKrueger/reStrainingOrder to find out (or possibly send me some 1-2M reads of your sample and I run it over here for you).
Regarding the Bowtie2 question: If your data is RNA-seq, you should use an RNA-seq aligner, like STAR or HISAT2. If your data is any other kind of DNA-seq application, Bowtie2 should be fine, too. In any case, if you are not using soft-clipping of reads you should use adapter trimming before running the alignments (e.g. Trim Galore, or equivalent). I believe in the latest dev version of SNPsplit I had a stab at supporting soft-clipped reads as well, so you might want to give that a go. Or now that you are an expert with this you could run STAR with and w/o soft-clipping, as well as Bowtie2, and then compare and contrast... :)
Hi Felix, Thanks for the elaborate answer. I also find the results better now :-).
In the experiment we used female C57BL/6J and male CAST/EIj. The genetic background is largely different between this two mice, and this combination is typical. I will use the tool you recommended to test the sample. But I would also appreciate it, if you can have a look at it.
About running reStrainingOrder - do I need to re-index the genome, or can it takes the same N-masked genome as I have used with SNPsplit? Can I re-run the tool with the sample bam file created for SNPsplit, or does it needs to be run on its own?
I will be happy to test-run the different possibilities for comparison purposes, but as I'm already working with the latest version (0.3.4), I can't find the dev version on the github page.
I will try though to run an adapter trimming beforehand and map the samples with and without soft-clipping. ( and maybe even with bowtie2 ;) ).
thanks
You can get the latest dev version using:
git clone git@github.com:FelixKrueger/SNPsplit.git
(using SSH) or
git clone https://github.com/FelixKrueger/SNPsplit.git
(using HTTPS)
For Restraining Order, you need to prepare a new N-masked reference, pretty much exactly following what is described here: https://github.com/FelixKrueger/reStrainingOrder/blob/master/Docs/README.md#running-restraining.
This also means that you cannot use the BAM files you have. And/or you send me a fe reads of your file and I run it over here.
Apparently I'm already using the dev version (Version: 0.3.4_dev_softer
).
But there are no actual parameters to assign when using soft-clipping.
I have tried though both options (with and w.o. soft-clipping), but the results are not substantially different. I have tried the following options
EndToEnd
Local
end-to-end
EndToEnd
Local
For option 1 i have already posted the results above, fir the other four, they are listed at the bottom.
In my opinion, all five runs show similar enough results, in all five the main bulk of reads map to genome 1 (BTW. reStrainingOrder
is still running), which might hint toward an unclean F1 mice. I'll let you know when I have more results.
My take from these five runs is, there is not a lot of difference between including soft-clipping data and excluding them (unless my run, exclude them automatically due to my parameters). Maybe you see something I don't.
thanks for all the help (and the patience).
2. STAR with `Local`
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 24790993
thereof were read pairs: 24790993
thereof were singletons: 0
Reads were unassignable (not overlapping SNPs): 8513646 (34.34%)
thereof were read pairs: 8513646
thereof were singletons: 0
Reads were specific for genome 1: 13174254 (53.14%)
thereof were read pairs: 13174254
thereof were singletons: 0
Reads were specific for genome 2: 3018905 (12.18%)
thereof were read pairs: 3018905
thereof were singletons: 0
Reads contained conflicting SNP information: 84188 (0.34%)
thereof were read pairs: 84188
thereof were singletons: 0
3. bowie2 with `end-to-end`
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 22342704
thereof were read pairs: 22342704
thereof were singletons: 0
Reads were unassignable (not overlapping SNPs): 7377690 (33.02%)
thereof were read pairs: 7377690
thereof were singletons: 0
Reads were specific for genome 1: 12262448 (54.88%)
thereof were read pairs: 12262448
thereof were singletons: 0
Reads were specific for genome 2: 2642665 (11.83%)
thereof were read pairs: 2642665
thereof were singletons: 0
Reads contained conflicting SNP information: 59901 (0.27%)
thereof were read pairs: 59901
thereof were singletons: 0
4. trim_galore -> STAR with `EndToEnd`
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 27049611
thereof were read pairs: 27028167
thereof were singletons: 21444
Reads were unassignable (not overlapping SNPs): 9907265 (36.63%)
thereof were read pairs: 9898597
thereof were singletons: 8668
Reads were specific for genome 1: 13878199 (51.31%)
thereof were read pairs: 13868383
thereof were singletons: 9816
Reads were specific for genome 2: 3180510 (11.76%)
thereof were read pairs: 3177615
thereof were singletons: 2895
Reads contained conflicting SNP information: 83637 (0.31%)
thereof were read pairs: 83572
thereof were singletons: 65
5. trim_galore -> STAR with `Local`
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 25934846
thereof were read pairs: 25913115
thereof were singletons: 21731
Reads were unassignable (not overlapping SNPs): 9380162 (36.17%)
thereof were read pairs: 9371439
thereof were singletons: 8723
Reads were specific for genome 1: 13572810 (52.33%)
thereof were read pairs: 13562787
thereof were singletons: 10023
Reads were specific for genome 2: 2916210 (11.24%)
thereof were read pairs: 2913292
thereof were singletons: 2918
Reads contained conflicting SNP information: 65664 (0.25%)
thereof were read pairs: 65597
thereof were singletons: 67
I find it encouraging that all these different approaches yield very comparable results. It is also good to see that the recently added soft-clipping support works! :) Maybe it is time for you to start the downstream analysis....
Hi again,
so reStrainingOrder
is finished. This was done with no errors, but somehow the file *.reStrainingOrder.summary_stats.txt
is empty, and therefore also no html file was created.
The command I used is as such
reStrainingOrder --snp_file /PATHtoFIle/MGPv5_SNP_matrix_chr1.txt.gz -o W36_star/ W36_star/W36_Nmasked_reStraining_StarAligned.out.bam
The report file is attached below. Again most of the reads ended up in the unassigned group. Did something went wrong here?
Looking at the top of the hybrid_score file
head W36_bowtie2/W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_scores.txt
Potential Hybrid Agreeing Disagreeing Percentage agreement Strain1 Index Strain2 Index
C57BL_6NJ/CAST_EiJ 11703172 50109 99.57 11 15
C57BL_10J/CAST_EiJ 11699946 53335 99.55 10 15
C58_J/CAST_EiJ 11526521 226760 98.07 14 15
C57BL_6NJ/MOLF_EiJ 11512746 240535 97.95 11 24
C57BL_10J/MOLF_EiJ 11507805 245476 97.91 10 24
C57BL_6NJ/SPRET_EiJ 11498752 254529 97.83 11 32
I can't say that I find any impurities in sample. Might there be another reason for the high mapping to the one genome compare to the other?
+++Edit+++ I was just thinking about it - should there be a ~50/~50 separation between the two strains in this kind of analysis? would a 90/10 ratio hints toward a problem with F1?
thanks Assa
less W36_bowtie2/W36_Nmasked_reStraining_bowtie2.reStrainingOrder_report.txt
Input file: 'W36_Nmasked_reStraining_bowtie2.bam'
Writing pure strain compatibility scores to: 'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.strain_scores.txt'
Writing possible hybrid compatibility scores to: 'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_scores.txt'
Writing hybrid allele ratio scores to: 'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_ratios.txt'
Writing summary statistics file to: 'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.summary_stats.txt'
Allele-tagging report
=====================
Processed 38202326 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
35456540 reads were unassignable (92.81%)
2378555 reads were specific for genome 1 (6.23%)
23104 reads were specific for genome 2 (0.06%)
33461417 reads did not contain one of the expected bases at known SNP positions (87.59%)
344127 contained conflicting allele-specific SNPs (0.90%)
SNP coverage report
===================
SNP annotation file: /fs/pool/pool-bcfngs/genomes/Mmu.GrCm38/ReStraining/MGPv5_SNP_matrix_chr1.txt.gz
SNPs stored in total: 5506653
N-containing reads: 36066128
non-N: 1995123
total: 38202326
Reads had a deletion of the N-masked position (and were thus dropped): 141075 (0.37%)
Of which had multiple deletions of N-masked positions within the same read: 1264
Of valid N containing reads,
N was present in the list of known SNPs: 11753281 (7.68%)
N was not present in the list of SNPs: 141220329 (92.32%)
Hi Assa,
It is odd that your summary stats file is empty... I wonder why that is. Do you think it would be possible for you to send me some reads of your sample, e.g. as gzipped attachment to an email?
The important part here is though:
C57BL_6NJ/CAST_EiJ 99.57
Which means that sample is almost fully consistent with a Black6/CAST cross. You should then be able to look at the file called *hybrid_ratios.txt
to see the allelic split.
Something like this:
C57BL_6NJ 330 1.0 CAST_EiJ 34049 99.0
Hi,
here is a link to a subsample of the bam file mapped against the Nmasked genome converted using reStraining
. download here ( the file is approx. 440Mb)
The command used for that was
reStraining --vcf_file ../SNPsplit/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome GRCm38/
The sample was mapped using STAR
with the commands
STAR --runThreadN 10 --runMode genomeGenerate --genomeDir MGP_STARindexed/ --genomeFastaFiles *.fa
STAR --runThreadN 10 --genomeDir MGP_STARindexed --readFilesCommand zcat --readFilesIn WK36.conc.R1.fastq.gz WK36.conc.R2.fastq.gz --outFileNamePrefix W36_Nmasked_reStraining_Star --outSAMtype BAM Unsorted --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx --alignEndsType EndToEnd
# subsampling
samtools view -h -s 0.1 W36_Nmasked_reStraining_StarAligned.out.bam | samtools view -Sb > W36_Nmasked.subsample.bam
About your response above, I guess the important row in the results is this one:
C57BL_6NJ 3367269 87.9 CAST_EiJ 464901 12.1
So does it means that I don't have a lot of CAST in F1?
I am also not sure about the last three lines in the report file in combination with ~93% of unassigned reads.
Of valid N containing reads,
N was present in the list of known SNPs: 11753281 (7.68%)
N was not present in the list of SNPs: 141220329 (92.32%)
what does it means? The calculated ratios were done only with the ~8%of SNP which were in the SNP file? or with the ~8% of assigned reads? Or were all reads taken into account?
Hi Assa,
I think the results of reStrainingOrder suggest that your cell line is almost 100% compatible with a Black6 C57BL/CAST hybrid strain. And within this analysis, it looks like ~88% of positions that are annotated as variants between mouse strains are compatible with C57BL, with ~12% compatible with CAST.
These results are well in line with the SNPsplit results above, which suggested that ~11.8% of reads are CAST-specific (genome2).
The last three lines of the report mean the following:
reStrainingOrder
only uses SNPs on chromosome 1 to keep the memory footprint (vaguely) acceptable. It is the biggest of all mouse chromosomes, and is normally fine to get an idea about general trends. ~7.7% of reads seem to align to chromosome 1, and these are the reads taken into consideration for reStrainingOrder results. the other reads map elsewhere in the genome, and were not taken into consideration.
I am currently running reStrainingOrder over here as well to see it the reports get generated. I can then attach the HTML report here for your benefit.
Bottom Line:
I think the results from reStrainingOrder
reinforce the results by SNPsplit, and give you a bit more piece of mind that your cells are indeed C57BL/CAST. You do seem to have a non-clean F1 hybrid line, and now you can start the downstream anaylysis.
Maybe you want to test a sample of your parental mouse lines to see if they are really as pure-bred as you thought they were? :)
Here we go:
And here is the report and the HTML file: reStrainingOrder.zip
thanks a lot for everything. it is a bit clearer now
Hi, I'm running SNPsplit with the mouse genome GRCm38 and the strain CAST_EiJ.
The workflow was as such
My first question is what is the difference in the results between a
full_sequence
run and anN-masked
run?Second question is my main problem, as I get only unassigned reads.
my SNP file after filtering:
so ~20M SNP were founds and passed the threshold. But how can I check how many of them overlap an exon (like you did in #24)?
Can this be a problem of using the STAR aligner? Should I try
bowtie2
?Any ideas?
thanks