STAR provides only unassigned reads

FelixKrueger / SNPsplit

Allele-specific alignment sorting

http://felixkrueger.github.io/SNPsplit/

GNU General Public License v3.0

53 stars 20 forks source link

STAR provides only unassigned reads #34

Closed yeroslaviz closed 4 years ago

yeroslaviz commented 4 years ago

Hi, I'm running SNPsplit with the mouse genome GRCm38 and the strain CAST_EiJ.

The workflow was as such

# genome preparations (for comparison reason I tried both full_seqnece and N-masked)
SNPsplit_genome_preparation --strain CAST_EiJ  --vcf Mmu.GrCm38/SNPsplit/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome GRCm38/ --nmasking --full_sequence --genome_build GRCm38

#  genome indexing (command added only for full sequence, but is identical to N-masked run, onlt the directories differ).
STAR --runThreadN 10 --runMode genomeGenerate --genomeDir STARIndex_full_sequence/ --genomeFastaFiles CAST_EiJ_full_sequence/*.fa

# Mapping
STAR  --runThreadN 10 --genomeDir Mmu.GrCm38/SNPsplit/STARIndex_full_sequence --readFilesCommand zcat --readFilesIn IP_H3K4m3_WK36.conc.R1.fastq.gz IP_H3K4m3_WK36.conc.R2.fastq.gz --outFileNamePrefix SNPsplit/W36_full_sequence. --outSAMtype BAM Unsorted --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx  --alignEndsType EndToEnd

# SNPsplit ( Am I using the correct SNP file?)
SNPsplit --snp_file Mmu.GrCm38/SNPsplit/all_SNPs_CAST_EiJ_GRCm38.txt.gz --paired --conflicting --verbose SNPsplit/W36_full_sequence.Aligned.out.bam

My first question is what is the difference in the results between a full_sequence run and an N-masked run?

Second question is my main problem, as I get only unassigned reads.

my SNP file after filtering:

cat CAST_EiJ_SNP_filtering_report.txt
SNP position summary for strain CAST_EiJ (based on genome build GRCm38)
===========================================================================

Positions read in total:        78772544

21331153        SNP were homozygous. Of these:
20668547        SNP were homozygous and passed high confidence filters and were thus included into the CAST_EiJ genome

Not included into CAST_EiJ genome:
54636251        had the same sequence as the reference
4808            had no clearly defined alternative base
2800332         Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
662606          were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_CAST_EiJ_GRCm38.txt.gz<...

so ~20M SNP were founds and passed the threshold. But how can I check how many of them overlap an exon (like you did in #24)?

Can this be a problem of using the STAR aligner? Should I try bowtie2?

Any ideas?

thanks

FelixKrueger commented 4 years ago

Hi @yeroslaviz

Generally, the difference between the N-masked genome and the full sequence is that the N-masked genome is required for SNPsplit to work. The full-sequence is more or less just required if you are planning to do something else with the strain sequence, here CAST. So with regard to SNPsplit you should only use the N-masked genome, a run against the full sequence genome should return absolutely nothing useful.

So I hope that aligning to the N-masked version of the genome will fix all your problems. (your command line looks fine otherwise).

Regarding the location of the SNPs (like in #24): I have used the SeqMonk software to import the SNPs as and determine the overlap with Exons (it's really only a few clicks).

yeroslaviz commented 4 years ago

Hi, thanks for the fast reply. It is still running so i can't say, But i have fund something else which I can't explain. I have mapped the sample with bowtie2 before I ran SNPsplit against the same genome (without the N-masking) with and without a duplication removal step (with picard). I than looked at the bam and bigwig/bedgraph files in IGV.

The region inside the red square was found only in the full-seq converted bam file. There are many reads here, but it seems to be completely missing the files mapped using bowtie2 in both the original and the de-duped bam.

Any idea what can cause this behavior?

[chr1 in IGV]

I will upload the results from the N-masked genome run as soon as it is done.

thanks again for the great tool

Assa

FelixKrueger commented 4 years ago

Hi Assa,

The screenshot you posted is highly puzzling. Really the only explanation I would have for this is that this could be a region with really tons and tons of SNPs, so that the alignments get rejected because of too many mismatches? I don't think this is really a SNPsplit issue itself though, or at least I couldn't think of anything else.

Could you also import the positions of the SNPs, or grep for that region in the SNP (all_SNPs_CAST...) to be sure?

yeroslaviz commented 4 years ago

two things -
one, I'm not familiar with Seqmonk, so I'm not really sure how to do it. I have uploaded the all_SNP file to Seqmonk as an annotation set (to where the gene/mRNAs are). But I don't seems to manage to overlap them. I have defined the probes using the feature probe generator, than quantitated them and ran a Feature Filter marking the allSNP and mRNA as the features to design around choosing to select probes that are overlapping. But this didn't seem to work well.

two, I have converted the SNP file into bed and uploaded it to IGV together with the other files. As you an see, exactly at this position it is clear of SNPs all together.

Is it possible, that the reads were changed during the SNP detection in full_sequence mode so much, that they were matched to a different position in such a number?

FelixKrueger commented 4 years ago

To 1) There should be plenty of training material out there that explains the basics of SeqMonk, including the manual, videos on YouTube etc.

I would probably have imported the mapped file in BAM format, filtering out reads with a MAPQ < 20 to get rid of multi-mapped reads. Then import the SNPs as you did as annotation track. Then you could have done any type of running window probe generation (e.g. 10000 wide probes, genome wide, followed by a read count quantitation (or do the Wiggle Plot Pipeline, which does those two things in a single operation. Next, go to your region of interest and see the quantified number of reads, check if there are any SNPs in that region etc. Pretty much what you have now done in IGV already.

To 2) I'm not sure why you think that something happened to your reads at all, simply mapping with STAR shouldn't do anything at all. All that the full_sequence genome preparation does is to incorporate the CAST SNPs to the GRCm38 reference sequence. I really don't know what happened there, but the complete absence of SNPs in that region is normally a sign that this region might be highly repetitive, and therefore the alignments might not be very reliable. I have a few additional comments:

I wouldn't waste further time with the full sequence genome, unless you are using pure CAST strains, and instead proceed to the N-masked genome alignments (as it doesn't make sense to use SNPsplit on the other file anyway)
maybe try to get rid of dodgy looking multi-mapping reads by filtering out reads with a MAPQ score of <20 and see if it all disappears.

Did the N-masked alignments finish by now, and did you get better allele-splitting stats?

yeroslaviz commented 4 years ago

Hi thanks, I'll try the two points you mention.

The N-masked is still running. I'll report back when it is done.

yeroslaviz commented 4 years ago

Hi again,

I have another question - is it usually the case, that it runs for such a long time?

It is already running on the cluster for several days.

How long does it should usually take to finish a run with 44685408 reads?

thanks Assa

samtools flagstat SNPsplit/W36_Nmasked/W36_N-masekd.Aligned.out.sortedByName.bam 
44685408 + 0 in total (QC-passed reads + QC-failed reads)
2257650 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
44685408 + 0 mapped (100.00% : N/A)
42427758 + 0 paired in sequencing
21213879 + 0 read1
21213879 + 0 read2
42427758 + 0 properly paired (100.00% : N/A)
42427758 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

FelixKrueger commented 4 years ago

Hi Assa,

Are we talking about STAR alignments that are running for a long time? SNPsplit shouldn't take more than a few minutes, or at least well under one hour...

yeroslaviz commented 4 years ago

As far as I can see it, SNPsplit is running now.

on the STDOUT I can see

...
Now looking at the following read:
8       34119222        148M    ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC    42N54N50
Read is a single continuous match: 148M
MD-tag: 42N54N50
42      54      50
element is 42
Start: 34119222 pos: 42
SNP was present: ref: C snp: G
pos: 42
ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC
base in read was: C
genome 1 specific!
now adjusting the position for the N itself
pos right now: 43
element is 54
Start: 34119222 pos: 97
SNP was present: ref: A snp: G
pos: 97
ATCCTGCAGTGACTGATCCGCAGGAATTGCAGATTTTCATAACGGAGTTGGTAAACCATAAAAACTTAAAAACAAAACAGCAAATCATGAGGTCCTAAGAGTCGAAGTACAGTAGTGAGGACGAAATGCTGACATCAAGATTTGCTGC
base in read was: A
genome 1 specific!
now adjusting the position for the N itself
pos right now: 98
~~~~~~~~~~~~~~~~~~~~
...

But this goes now for several days

I'm not sure, whether or not I should stop it and restart the command.

FelixKrueger commented 4 years ago

Ahhh, now that makes sense. It appears like you started SNPsplit with the --verbose function. This produces lots of verbose output and comes with several sleep statements... Just stop it, re-run the same command but drop --verbose. It should then finish within the hour.

Good luck!

yeroslaviz commented 4 years ago

Hi,

you were right. It is now done :-)

Now I need to understand the output. I find the number of unassigned reads quite high, but I don't have any comparisons.

Do you think it make sense to run the mapping with bowtie2 instead? Would it performs better than star without the soft-clipping?

I attach below the complete report.

thanks for all the help Assa

Input file:                                     'W36_N-masekd.Aligned.out.bam'
Writing allele-flagged output file to:          'W36_N-masekd.Aligned.out.allele_flagged.bam'

Allele-tagging report
=====================
Processed 44685408 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
20279202 reads were unassignable (45.38%)
20029556 reads were specific for genome 1 (44.82%)
4320474 reads were specific for genome 2 (9.67%)
24594 reads did not contain one of the expected bases at known SNP positions (0.06%)
56176 contained conflicting allele-specific SNPs (0.13%)

SNP coverage report
===================
SNP annotation file:    /fs/pool/pool-bcfngs/genomes/Mmu.GrCm38/SNPsplit/all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:   20668547
N-containing reads:     24411040
non-N:                  20254608
total:                  44685408
Reads had a deletion of the N-masked position (and were thus dropped):  19760 (0.04%)
Of which had multiple deletions of N-masked positions within the same read:     14

Of valid N containing reads,
N was present in the list of known SNPs:        52082500 (99.96%)
N was not present in the list of SNPs:          23418 (0.04%)

Input file:                                             'W36_N-masekd.Aligned.out.allele_flagged.bam'
Writing unassigned reads to:                            'W36_N-masekd.Aligned.out.unassigned.bam'
Writing genome 1-specific reads to:                     'W36_N-masekd.Aligned.out.genome1.bam'
Writing genome 2-specific reads to:                     'W36_N-masekd.Aligned.out.genome2.bam'
Writing reads with conflicting number of SNPs to:       'W36_N-masekd.Aligned.out.conflicting.bam'

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               22342704
        thereof were read pairs:                        22342704
        thereof were singletons:                        0
Reads were unassignable (not overlapping SNPs):         7377690 (33.02%)
        thereof were read pairs:        7377690
        thereof were singletons:        0
Reads were specific for genome 1:                       12262448 (54.88%)
        thereof were read pairs:        12262448
        thereof were singletons:        0
Reads were specific for genome 2:                       2642665 (11.83%)
        thereof were read pairs:        2642665
        thereof were singletons:        0
Reads contained conflicting SNP information:            59901 (0.27%)
        thereof were read pairs:        59901
        thereof were singletons:        0

FelixKrueger commented 4 years ago

Hi Assa,

now that looks a lot better! Just generally, the number of reads that can be assigned to different alleles depends on a number of things:

the read length
single-end/paired-end
the number of SNPs
the type of sequencing

You have here close to 70% of assignable reads, which I find spectacularly good to be honest. The weird thing is that you've got ~55% specific for Black6, but only ~12% specific for CAST. This would argue that you are not looking at clean F1 hybrid cells with a 50/50 split, but probably either something like cells that had been re-crossed several times. Is that a possibility?

If you are not exactly sure about the genetic background of your samples you could also take a look at https://github.com/FelixKrueger/reStrainingOrder to find out (or possibly send me some 1-2M reads of your sample and I run it over here for you).

Regarding the Bowtie2 question: If your data is RNA-seq, you should use an RNA-seq aligner, like STAR or HISAT2. If your data is any other kind of DNA-seq application, Bowtie2 should be fine, too. In any case, if you are not using soft-clipping of reads you should use adapter trimming before running the alignments (e.g. Trim Galore, or equivalent). I believe in the latest dev version of SNPsplit I had a stab at supporting soft-clipped reads as well, so you might want to give that a go. Or now that you are an expert with this you could run STAR with and w/o soft-clipping, as well as Bowtie2, and then compare and contrast... :)

yeroslaviz commented 4 years ago

Hi Felix, Thanks for the elaborate answer. I also find the results better now :-).

In the experiment we used female C57BL/6J and male CAST/EIj. The genetic background is largely different between this two mice, and this combination is typical. I will use the tool you recommended to test the sample. But I would also appreciate it, if you can have a look at it.

About running reStrainingOrder - do I need to re-index the genome, or can it takes the same N-masked genome as I have used with SNPsplit? Can I re-run the tool with the sample bam file created for SNPsplit, or does it needs to be run on its own?

I will be happy to test-run the different possibilities for comparison purposes, but as I'm already working with the latest version (0.3.4), I can't find the dev version on the github page.

I will try though to run an adapter trimming beforehand and map the samples with and without soft-clipping. ( and maybe even with bowtie2 ;) ).

thanks

FelixKrueger commented 4 years ago

You can get the latest dev version using: git clone git@github.com:FelixKrueger/SNPsplit.git (using SSH) or git clone https://github.com/FelixKrueger/SNPsplit.git (using HTTPS)

For Restraining Order, you need to prepare a new N-masked reference, pretty much exactly following what is described here: https://github.com/FelixKrueger/reStrainingOrder/blob/master/Docs/README.md#running-restraining.

This also means that you cannot use the BAM files you have. And/or you send me a fe reads of your file and I run it over here.

yeroslaviz commented 4 years ago

Apparently I'm already using the dev version (Version: 0.3.4_dev_softer). But there are no actual parameters to assign when using soft-clipping.

I have tried though both options (with and w.o. soft-clipping), but the results are not substantially different. I have tried the following options

STAR with EndToEnd
STAR with Local
bowie2 with end-to-end
trim_galore -> STAR with EndToEnd
trim_galore -> STAR with Local

For option 1 i have already posted the results above, fir the other four, they are listed at the bottom.

In my opinion, all five runs show similar enough results, in all five the main bulk of reads map to genome 1 (BTW. reStrainingOrder is still running), which might hint toward an unclean F1 mice. I'll let you know when I have more results.

My take from these five runs is, there is not a lot of difference between including soft-clipping data and excluding them (unless my run, exclude them automatically due to my parameters). Maybe you see something I don't.

thanks for all the help (and the patience).

2. STAR with `Local`

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               24790993
        thereof were read pairs:                        24790993
        thereof were singletons:                        0
Reads were unassignable (not overlapping SNPs):         8513646 (34.34%)
        thereof were read pairs:        8513646
        thereof were singletons:        0
Reads were specific for genome 1:                       13174254 (53.14%)
        thereof were read pairs:        13174254
        thereof were singletons:        0
Reads were specific for genome 2:                       3018905 (12.18%)
        thereof were read pairs:        3018905
        thereof were singletons:        0
Reads contained conflicting SNP information:            84188 (0.34%)
        thereof were read pairs:        84188
        thereof were singletons:        0

3. bowie2 with `end-to-end`

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               22342704
        thereof were read pairs:                        22342704
        thereof were singletons:                        0
Reads were unassignable (not overlapping SNPs):         7377690 (33.02%)
        thereof were read pairs:        7377690
        thereof were singletons:        0
Reads were specific for genome 1:                       12262448 (54.88%)
        thereof were read pairs:        12262448
        thereof were singletons:        0
Reads were specific for genome 2:                       2642665 (11.83%)
        thereof were read pairs:        2642665
        thereof were singletons:        0
Reads contained conflicting SNP information:            59901 (0.27%)
        thereof were read pairs:        59901
        thereof were singletons:        0

4. trim_galore -> STAR with `EndToEnd`

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               27049611
        thereof were read pairs:                        27028167
        thereof were singletons:                        21444
Reads were unassignable (not overlapping SNPs):         9907265 (36.63%)
        thereof were read pairs:        9898597
        thereof were singletons:        8668
Reads were specific for genome 1:                       13878199 (51.31%)
        thereof were read pairs:        13868383
        thereof were singletons:        9816
Reads were specific for genome 2:                       3180510 (11.76%)
        thereof were read pairs:        3177615
        thereof were singletons:        2895
Reads contained conflicting SNP information:            83637 (0.31%)
        thereof were read pairs:        83572
        thereof were singletons:        65

5. trim_galore -> STAR with `Local`

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               25934846
        thereof were read pairs:                        25913115
        thereof were singletons:                        21731
Reads were unassignable (not overlapping SNPs):         9380162 (36.17%)
        thereof were read pairs:        9371439
        thereof were singletons:        8723
Reads were specific for genome 1:                       13572810 (52.33%)
        thereof were read pairs:        13562787
        thereof were singletons:        10023
Reads were specific for genome 2:                       2916210 (11.24%)
        thereof were read pairs:        2913292
        thereof were singletons:        2918
Reads contained conflicting SNP information:            65664 (0.25%)
        thereof were read pairs:        65597
        thereof were singletons:        67

FelixKrueger commented 4 years ago

I find it encouraging that all these different approaches yield very comparable results. It is also good to see that the recently added soft-clipping support works! :) Maybe it is time for you to start the downstream analysis....

yeroslaviz commented 4 years ago

Hi again, so reStrainingOrder is finished. This was done with no errors, but somehow the file *.reStrainingOrder.summary_stats.txt is empty, and therefore also no html file was created.

The command I used is as such

 reStrainingOrder --snp_file /PATHtoFIle/MGPv5_SNP_matrix_chr1.txt.gz -o W36_star/ W36_star/W36_Nmasked_reStraining_StarAligned.out.bam

The report file is attached below. Again most of the reads ended up in the unassigned group. Did something went wrong here?

Looking at the top of the hybrid_score file

 head W36_bowtie2/W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_scores.txt
Potential Hybrid        Agreeing        Disagreeing     Percentage agreement    Strain1 Index   Strain2 Index
C57BL_6NJ/CAST_EiJ      11703172        50109   99.57   11      15
C57BL_10J/CAST_EiJ      11699946        53335   99.55   10      15
C58_J/CAST_EiJ  11526521        226760  98.07   14      15
C57BL_6NJ/MOLF_EiJ      11512746        240535  97.95   11      24
C57BL_10J/MOLF_EiJ      11507805        245476  97.91   10      24
C57BL_6NJ/SPRET_EiJ     11498752        254529  97.83   11      32

I can't say that I find any impurities in sample. Might there be another reason for the high mapping to the one genome compare to the other?

+++Edit+++ I was just thinking about it - should there be a ~50/~50 separation between the two strains in this kind of analysis? would a 90/10 ratio hints toward a problem with F1?

thanks Assa

less W36_bowtie2/W36_Nmasked_reStraining_bowtie2.reStrainingOrder_report.txt
Input file:                                             'W36_Nmasked_reStraining_bowtie2.bam'
Writing pure strain compatibility scores to:            'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.strain_scores.txt'
Writing possible hybrid compatibility scores to:        'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_scores.txt'
Writing hybrid allele ratio scores to:                  'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.hybrid_ratios.txt'
Writing summary statistics file to:                     'W36_Nmasked_reStraining_bowtie2.reStrainingOrder.summary_stats.txt'

Allele-tagging report
=====================
Processed 38202326 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
35456540 reads were unassignable (92.81%)
2378555 reads were specific for genome 1 (6.23%)
23104 reads were specific for genome 2 (0.06%)
33461417 reads did not contain one of the expected bases at known SNP positions (87.59%)
344127 contained conflicting allele-specific SNPs (0.90%)

SNP coverage report
===================
SNP annotation file:    /fs/pool/pool-bcfngs/genomes/Mmu.GrCm38/ReStraining/MGPv5_SNP_matrix_chr1.txt.gz
SNPs stored in total:   5506653
N-containing reads:     36066128
non-N:                  1995123
total:                  38202326
Reads had a deletion of the N-masked position (and were thus dropped):  141075 (0.37%)
Of which had multiple deletions of N-masked positions within the same read:     1264

Of valid N containing reads,
N was present in the list of known SNPs:        11753281 (7.68%)
N was not present in the list of SNPs:          141220329 (92.32%)

FelixKrueger commented 4 years ago

Hi Assa,

It is odd that your summary stats file is empty... I wonder why that is. Do you think it would be possible for you to send me some reads of your sample, e.g. as gzipped attachment to an email?

The important part here is though: C57BL_6NJ/CAST_EiJ 99.57

Which means that sample is almost fully consistent with a Black6/CAST cross. You should then be able to look at the file called *hybrid_ratios.txt to see the allelic split.

Something like this: C57BL_6NJ 330 1.0 CAST_EiJ 34049 99.0

yeroslaviz commented 4 years ago

Hi,

here is a link to a subsample of the bam file mapped against the Nmasked genome converted using reStraining. download here ( the file is approx. 440Mb)

The command used for that was

reStraining --vcf_file ../SNPsplit/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome GRCm38/

The sample was mapped using STAR with the commands

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir MGP_STARindexed/ --genomeFastaFiles *.fa
STAR  --runThreadN 10 --genomeDir MGP_STARindexed --readFilesCommand zcat --readFilesIn WK36.conc.R1.fastq.gz WK36.conc.R2.fastq.gz --outFileNamePrefix W36_Nmasked_reStraining_Star --outSAMtype BAM Unsorted --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx  --alignEndsType EndToEnd
# subsampling
samtools view -h -s 0.1 W36_Nmasked_reStraining_StarAligned.out.bam | samtools view -Sb > W36_Nmasked.subsample.bam

About your response above, I guess the important row in the results is this one:

C57BL_6NJ 3367269 87.9 CAST_EiJ 464901 12.1

So does it means that I don't have a lot of CAST in F1?

I am also not sure about the last three lines in the report file in combination with ~93% of unassigned reads.

Of valid N containing reads,
N was present in the list of known SNPs:        11753281 (7.68%)
N was not present in the list of SNPs:          141220329 (92.32%)

what does it means? The calculated ratios were done only with the ~8%of SNP which were in the SNP file? or with the ~8% of assigned reads? Or were all reads taken into account?

FelixKrueger commented 4 years ago

Hi Assa,

I think the results of reStrainingOrder suggest that your cell line is almost 100% compatible with a Black6 C57BL/CAST hybrid strain. And within this analysis, it looks like ~88% of positions that are annotated as variants between mouse strains are compatible with C57BL, with ~12% compatible with CAST.

These results are well in line with the SNPsplit results above, which suggested that ~11.8% of reads are CAST-specific (genome2).

The last three lines of the report mean the following: reStrainingOrder only uses SNPs on chromosome 1 to keep the memory footprint (vaguely) acceptable. It is the biggest of all mouse chromosomes, and is normally fine to get an idea about general trends. ~7.7% of reads seem to align to chromosome 1, and these are the reads taken into consideration for reStrainingOrder results. the other reads map elsewhere in the genome, and were not taken into consideration.

I am currently running reStrainingOrder over here as well to see it the reports get generated. I can then attach the HTML report here for your benefit.

Bottom Line: I think the results from reStrainingOrder reinforce the results by SNPsplit, and give you a bit more piece of mind that your cells are indeed C57BL/CAST. You do seem to have a non-clean F1 hybrid line, and now you can start the downstream anaylysis.

Maybe you want to test a sample of your parental mouse lines to see if they are really as pure-bred as you thought they were? :)

FelixKrueger commented 4 years ago

Here we go:

And here is the report and the HTML file: reStrainingOrder.zip

yeroslaviz commented 4 years ago

thanks a lot for everything. it is a bit clearer now