BoevaLab / FREEC

Control-FREEC: Copy number and genotype annotation in whole genome and whole exome sequencing data
152 stars 49 forks source link

FREEC was not able to extract reads #2

Closed ramaniak closed 8 years ago

ramaniak commented 8 years ago

Hello, I am trying to use FREEC on a paired (Tum/Norm) WGS data set of Illumina paired end bam files. I keep getting the following error

finished reading T62.sorted.dup.recal.bam
PROFILING [tid=46912512328960]:  T62.sorted.dup.recal.bam read in 8929 seconds [fillMyHash]
990013582 lines read..
0 reads used to compute copy number profile
Error: FREEC was not able to extract reads from T62.sorted.dup.recal.bam

Check your parameters: inputFormat and matesOrientation
Use "matesOrientation=0" if you have single end reads
Check the list of possible input formats at http://bioinfo-out.curie.fr/projects/freec/tutorial.html#CONFIG

If you use sorted SAM or BAM, please set "mateOrientation=0"; then FREEC will not try to detect pairs with normal orientation and insert size. Instead, it will keep all pairs from the input file

I have repeated the analysis with all possible input options for matesOrientation (FR, RF and 0) and get the same error. I am using BAM for my inputFormat and I am using Control-FREEC v9.3. Any help would be much appreciated.

Thanks Arun

valeu commented 8 years ago

Most likely chromosome names in your BAM file do not correspond to chromosome names in your chr_len file. Please, check.

ramaniak commented 8 years ago

I checked and that does not seem to be the issue

bam header

@SQ SN:1      LN:249250621
@SQ SN:2      LN:243199373
@SQ SN:3      LN:198022430
@SQ SN:4      LN:191154276
.
.
.

fai file

1       249250621       52      60      61
2       243199373       253404903       60      61
3       198022430       500657651       60      61
4       191154276       701980507       60      61
5       180915260       896320740       60      61
.
.
valeu commented 8 years ago

Actually, FREEC does not understand the fa.fai without the "chr" prefixes. Just add them to the fai file and it should work. It is OK not to have "chr" in the BAMs

ramaniak commented 8 years ago

I see. So, FREEC would be fine if the "chr"ids did not match? Cool, thanks Arun

On Mon, Aug 8, 2016 at 10:50 AM, Valentina Boeva notifications@github.com wrote:

Actually, FREEC does not understand the fa.fai without the "chr" prefixes. Just add them to the fai file and it should work. It is OK not to have "chr" in the BAMs

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BoevaLab/FREEC/issues/2#issuecomment-238261681, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzoBpUWh15MHbPiR2-6QC21ii98pjf-ks5qd0IrgaJpZM4Jeo6g .

valeu commented 8 years ago

Yes, FREEC removes "chr" when it reads it. But it needs it in the fa.fai to make a difference between:

this: 1 1 249250621
2 2 243199373
3 3 198022430
4 4 191154276
5 5 180915260

this: 1 chr1 249250621
2 chr2 243199373
3 chr3 198022430
4 chr4 191154276
5 chr5 180915260

and this:

chr1 249250621 52 60 61 chr2 243199373 253404903 60 61 chr3 198022430 500657651 60 61 chr4 191154276 701980507 60 61 chr5 180915260 896320740 60 61

yjx1217 commented 7 years ago

Hello, I've encountered the same error but I've already add "chr" in my refseq.fa.fai file. Could you help to check what might cause the problem? Thanks!

Control-FREEC v10.6 : a method for automatic detection of copy number alterations, subclones and for accurate estimation of contamination and main ploidy using deep-sequencing data Multi-threading mode using 10 threads ..Breakpoint threshold for segmentation of copy number profiles is 0.8 ..telocenromeric set to 7000 ..FREEC is not going to output normalized copy number profiles into a BedGraph file (for example, for visualization in the UCSC GB). Use "[general] BedGraphOutput=TRUE" if you want a BedGraph file ..FREEC is not going to adjust profiles for a possible contamination by normal cells ..Window = 250 was set ..Output directory: FREEC_out ..Directory with files containing chromosome sequences: /home/jxyue/Projects/Suppressor_Mutant/SGDref_Chr/ ..Sample file: TSQ1310.merged.rdgrp.bam ..Sample input format: BAM ..will use this instance of samtools: 'samtools' to read BAM files ..minimal expected GC-content (general parameter "minExpectedGC") was set to 0.35 ..maximal expected GC-content (general parameter "maxExpectedGC") was set to 0.55 ..Polynomial degree for "ReadCount ~ GC-content" normalization is 3 or 4: will try both ..Minimal CNA length (in windows) is 1 ..File with chromosome lengths: /home/jxyue/Projects/Suppressor_Mutant/refseq.fa.fai ..Using the default minimal mappability value of 0.85 ..uniqueMatch = FALSE ..average ploidy set to 1 ..break-point type set to 4 ..noisyData set to 0 ..Control-FREEC will not look for subclones ..File /home/jxyue/Projects/Suppressor_Mutant/refseq.fa.fai was read ..Starting reading TSQ1310.merged.rdgrp.bam ..samtools should be installed to be able to read BAM files; will use the following command for samtools: samtools view TSQ1310.merged.rdgrp.bam ..finished reading TSQ1310.merged.rdgrp.bam PROFILING [tid=140541144065856]: TSQ1310.merged.rdgrp.bam read in 17 seconds [fillMyHash] 5454812 lines read.. 0 reads used to compute copy number profile

Error: FREEC was not able to extract reads from TSQ1310.merged.rdgrp.bam

Check your parameters: inputFormat and matesOrientation Use "matesOrientation=0" if you have single end reads Check the list of possible input formats at http://bioinfo-out.curie.fr/projects/freec/tutorial.html#CONFIG

If you use sorted SAM or BAM, please set "mateOrientation=0"; then FREEC will not try to detect pairs with normal orientation and insert size. Instead, it will keep all pairs from the input file

In my config file, I've already set: inputFormat = BAM mateOrientation=0 # use "mateOrientation=0" for sorted .SAM and .BAM

My refseq.fa.fai file goes like this:

chrI 230218 6 100 101 chrII 813184 232534 100 101 chrIII 316620 1053858 100 101 chrIV 1531933 1373652 100 101 chrV 576874 2920911 100 101 chrVI 270161 3503561 100 101 chrVII 1090940 3776432 100 101 chrVIII 562643 4878291 100 101 chrIX 439888 5446568 100 101 chrX 745751 5890861 100 101 chrXI 666816 6644077 100 101 chrXII 1078177 7317570 100 101 chrXIII 924431 8406538 100 101 chrXIV 784333 9340222 100 101 chrXV 1091291 10132406 100 101 chrXVI 948066 11234618 100 101 chrmt 85779 12192172 100 101 chrS288c_plasmid 6318 12278824 100 101

And the header of my input bam file goes like this: @HD VN:1.5 GO:none SO:coordinate @SQ SN:chrI LN:230218 @SQ SN:chrII LN:813184 @SQ SN:chrIII LN:316620 @SQ SN:chrIV LN:1531933 @SQ SN:chrV LN:576874 @SQ SN:chrVI LN:270161 @SQ SN:chrVII LN:1090940 @SQ SN:chrVIII LN:562643 @SQ SN:chrIX LN:439888 @SQ SN:chrX LN:745751 @SQ SN:chrXI LN:666816 @SQ SN:chrXII LN:1078177 @SQ SN:chrXIII LN:924431 @SQ SN:chrXIV LN:784333 @SQ SN:chrXV LN:1091291 @SQ SN:chrXVI LN:948066 @SQ SN:chrmt LN:85779 @SQ SN:S288c_plasmid LN:6318 @RG ID:TSQ1310.merged LB:TSQ1310.merged PL:Illumina SM:TSQ1310 PU:TSQ1310.merged CN:Sanger @PG ID:GATK IndelRealigner VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=21836_2_96_2.SGDref.realn.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:GATK IndelRealigner.5 VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=21836_3_96_3.SGDref.realn.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[21836_2_96_2.SGDref.rdgrp.bam] OUTPUT=21836_2_96_2.SGDref.dedup.bam METRICS_FILE=21836_2_96_2.SGDref.dedup.matrics REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:MarkDuplicates.1 VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[21836_3_96_3.SGDref.rdgrp.bam] OUTPUT=21836_3_96_3.SGDref.dedup.bam METRICS_FILE=21836_3_96_3.SGDref.dedup.matrics REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:/home/jxyue/Programs/bwa/bwa mem -t 4 -M refseq.fa 21836_2_96_1_trim.fq.gz 21836_2_96_2_trim.fq.gz @PG ID:bwa.3 PN:bwa VN:0.7.15-r1140 CL:/home/jxyue/Programs/bwa/bwa mem -t 4 -M refseq.fa 21836_3_96_1_trim.fq.gz 21836_3_96_2_trim.fq.gz

Best, Jia-Xing

valeu commented 7 years ago

The problem was that the line of mateOrientation contains a comment starting with "#". So FREEC assumed that the orientation was "unknown". So it discarded all reads. The solution was to use: mateOrientation=0

This bug will be fixed in v10.10

yjx1217 commented 7 years ago

Hi Valentina,

Thank you very much! It works!

Best, Jia-Xing

On Thu, Jul 6, 2017 at 6:01 PM, Valentina Boeva notifications@github.com wrote:

The problem was that the line of mateOrientation contains a comment starting with "#". So FREEC assumed that the orientation was "unknown". So it discarded all reads. The solution was to use: mateOrientation=0

This bug will be fixed in v10.10

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BoevaLab/FREEC/issues/2#issuecomment-313440667, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq19rRq0SRXmy34tTjD4UXPbXzh_xiZks5sLQTTgaJpZM4Jeo6g .

-- Jia-Xing Yue

Population Genomics and Complex Traits Group Tour Pasteur 8eme etage Faculté de Médecine Institute for Research on Cancer and Aging, Nice (IRCAN) CNRS UMR 7284 - INSERM U 1081 - Université Côte d’Azur (UCA) 28 Avenue de Valombrose 06107 NICE Cedex 2 France

Personal website: http://www.iamphioxus.org/

yjx1217 commented 7 years ago

Hi Valentina,

After the FREEC run finished, I further run the assess_significance.R script as demonstrated in the manual: cat assess_significance.R | R --slave --args input_CNVs input_ratio.txt

I got the following warnings (highlighted in blue):

DAMP1354_2.merged.rdgrp.bam Loading required package: GenomicRanges Loading required package: stats4 Loading required package: BiocGenerics Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges Loading required package: GenomeInfoDb 'BiocParallel' did not register default BiocParallelParams: invalid class “MulticoreParam” object: 1: ‘cluster’, ‘.clusterargs’, ‘.uid’, ‘RNGseed’ must be length 1 invalid class “MulticoreParam” object: 2: ‘.clusterargs’, ‘.controlled’, ‘logdir’, ‘resultdir’ must be length 1 Warning message: In is.na(x[[i]]) : is.na() applied to non-(list or vector) of type 'environment' There were 50 or more warnings (use warnings() to see the first 50)

Should I trust the output files or something is needed to fix the warning first? Thanks in advance!

Best, Jia-Xing

On Thu, Jul 6, 2017 at 6:19 PM, Jia-Xing Yue yuejiaxing@gmail.com wrote:

Hi Valentina,

Thank you very much! It works!

Best, Jia-Xing

On Thu, Jul 6, 2017 at 6:01 PM, Valentina Boeva notifications@github.com wrote:

The problem was that the line of mateOrientation contains a comment starting with "#". So FREEC assumed that the orientation was "unknown". So it discarded all reads. The solution was to use: mateOrientation=0

This bug will be fixed in v10.10

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BoevaLab/FREEC/issues/2#issuecomment-313440667, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq19rRq0SRXmy34tTjD4UXPbXzh_xiZks5sLQTTgaJpZM4Jeo6g .

-- Jia-Xing Yue

Population Genomics and Complex Traits Group Tour Pasteur 8eme etage Faculté de Médecine Institute for Research on Cancer and Aging, Nice (IRCAN) CNRS UMR 7284 - INSERM U 1081 - Université Côte d’Azur (UCA) 28 Avenue de Valombrose 06107 NICE Cedex 2 France

Personal website: http://www.iamphioxus.org/

-- Jia-Xing Yue

Population Genomics and Complex Traits Group Tour Pasteur 8eme etage Faculté de Médecine Institute for Research on Cancer and Aging, Nice (IRCAN) CNRS UMR 7284 - INSERM U 1081 - Université Côte d’Azur (UCA) 28 Avenue de Valombrose 06107 NICE Cedex 2 France

Personal website: http://www.iamphioxus.org/

valeu commented 7 years ago

do you get an output? :)

yjx1217 commented 7 years ago

Yes, the output looks normal to me. The R version that I am using is "R version 3.4.0 (2017-04-21)".

Best, Jia-Xing

On Fri, Jul 7, 2017 at 5:21 PM, Valentina Boeva notifications@github.com wrote:

do you get an output? :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BoevaLab/FREEC/issues/2#issuecomment-313712325, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq19tdWU5kSMvpHXq8BqrLrkpRnK7Lzks5sLk0FgaJpZM4Jeo6g .

-- Jia-Xing Yue

Population Genomics and Complex Traits Group Tour Pasteur 8eme etage Faculté de Médecine Institute for Research on Cancer and Aging, Nice (IRCAN) CNRS UMR 7284 - INSERM U 1081 - Université Côte d’Azur (UCA) 28 Avenue de Valombrose 06107 NICE Cedex 2 France

Personal website: http://www.iamphioxus.org/

valeu commented 7 years ago

Then, it should be OK.