COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
761 stars 160 forks source link

Inconsistent library type #137

Open ckr123tw opened 7 years ago

ckr123tw commented 7 years ago

Hi, I tried to let Salmon infer library type automatically on TCGA data, however, it seemed to find a high percentage of reads (nearly 40%) inconsistent with the inferred library type. Did I made a mistake in the processing steps? Are the results still valid? Thanks.

Command line: samtools bam2fq 59cd4694-4001-483b-9a30-33a166943bd1_gdc_realn_rehead.bam > 59cd4694-4001-483b-9a30-33a166943bd1.fastq cat 59cd4694-4001-483b-9a30-33a166943bd1.fastq|grep '^@./1$' -A 3 --no-group-separator > 59cd4694-4001-483b-9a30-33a166943bd1_R1.fastq cat 59cd4694-4001-483b-9a30-33a166943bd1.fastq|grep '^@./2$' -A 3 --no-group-separator > 59cd4694-4001-483b-9a30-33a166943bd1_R2.fastq salmon quant -i ~/hg38_ref/GCA_000001405.15_GRCh38_no_alt -l A -1 59cd4694-4001-483b-9a30-33a166943bd1_R1.fastq.gz -2 59cd4694-4001-483b-9a30-33a166943bd1_R2.fastq.gz -p 6 -o ./SalmonQuant

lib_format_counts.json: "read_files": "( 59cd4694-4001-483b-9a30-33a166943bd1_R1.fastq.gz, 59cd4694-4001-483b-9a30-33a166943bd1_R2.fastq.gz )", "expected_format": "MU", "compatible_fragment_ratio": 0.6136629002400855, "num_compatible_fragments": 27796954, "num_assigned_fragments": 45296781, "num_consistent_mappings": 115802038, "num_inconsistent_mappings": 95238168, "strand_mapping_bias": 0.5013267642146333, "MSF": 0, "OSF": 544092, "ISF": 39968962, "MSR": 0, "OSR": 556020, "ISR": 39905960, "SF": 7149723, "SR": 7113411, "MU": 0, "OU": 0, "IU": 0, "U": 0

bounlu commented 6 years ago

I usually get IU for auto-detected library type for TCGA samples.

rob-p commented 6 years ago

Yup, and the fact that this ended up as MU is strange, since the library type frequencies clearly suggest IU (since ISF and ISR counts seem to dominate). Could it be the result of having the FASTQ files generated by converting from BAM which some sort of bias in the beginning reads? The automatic detection uses the first 10,000 reads to decide --- if these are mapped in a biased way, that could be the cause.