BoevaLab / FREEC

Control-FREEC: Copy number and genotype annotation in whole genome and whole exome sequencing data
153 stars 49 forks source link

Apparently only one BAM file allowed as input; but Config file parser doesn't warn user if they add more. #88

Open rsharris opened 3 years ago

rsharris commented 3 years ago

I'd previously been using a pileup file as input, created before running freec. But for one of my genomes samtools mpileup failed, and the freec author advised me to try using BAM as input to freec (instead of pileup).

I have 34 BAM files. I couldn't find anything in the documentation that tells me whether I should list those as 34 instances of mateFile in my config file, or whether I will have to merge those into a single BAM file. (spoiler alert: you have to merge them).

So I tried creating a fake genome, simulating three sets of reads from that, aligning them to create sorted BAMS, and listing those as three instances of mateFile in my config file. Sadly, I ran into the mysterious "zero reads in windows with the GC-content ... try to rerun the program with higher number of reads" error. So I was unable to directly answer my question.

Failing that, I then went digging in the source code. I'm 99.9999% certain that Config.cpp only supports one value for any key. So apparently the use of only one BAM file is supported. Config.cpp doesn't catch cases where the user violates this assumption. So even if I did have a successful run where I listed all 34 BAM files, I'd really have to dig hard to realize that the program ignored 33 of them.

valeu commented 3 years ago

Dear Bob, sorry that FREEC caused you so many troubles. Indeed, it accepts only one BAM. But I will add it to the TODO list to make it accept more. Merging BAMs is not an ideal solution as GC-content bias is likely to vary across sequencing batches.

rsharris commented 3 years ago

Thanks.

Wouldn't samtools mpileup have the same issue regarding batch-dependent GC bias?

valeu commented 3 years ago

yes, it would. What I need to do is ask for separate files and independently normalize these read counts for the GC-content bias, then average the normalized values... You can propose a better idea, I guess.

rsharris commented 3 years ago

Looking in more detail at the pileup format, and the pileup files that samtools mpileup is creating for me, I'm not sure pileup would have a problem. But I am seeing conflicting information from those sources.

According to the description at www.htslib.org/doc/samtools-mpileup.html , samtools mpileup keeps the info from each BAM file separate ("remaining columns show the pileup data, and are repeated for each input BAM file specified"). If this is true, then freec would be able to treat (the information that originated at) each BAM file separately.

But I can't tell if that's true. en.wikipedia.org/wiki/Pileup_format makes no mention of separate per-BAM column groups. samtools.sourceforge.net/pileup.shtml agrees with wikipedia.

The pileup files mpileup is producing for me do have those per-BAM column groups.

I wonder how freec is parsing the pileup file. In those cases where I got it to run, I wonder if it was silently ignoring all but the first group of columns in my pileup file? I.e. doing the same thing it would do if you give it more than one BAM file.

rsharris commented 3 years ago

I have verified that freec ignores all but the first per-BAM column group when reading a pileup file.

Verified both experimentally on a small simulated data set. And by looking at the code, which reads only columns 1, 2, and 5 of the pileup file. :(

I guess I am missing something. Every sequencing project I've been involved with, over the past 16 years, has had more than one sequencing run. Yet it seems freec presumes a single sequencing run. What am I missing?

valeu commented 3 years ago

FREEC was first developed for low coverage data, so historically it works on one run only. THe user can also merge BAM files together and run FREEC on this merged BAM (which is not ideal, as I explained earlier.) So an update in the code is needed.

rsharris commented 3 years ago

Cool, thanks! I am in the process of trying merged BAM for my project.