bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
984 stars 353 forks source link

Replace deprecated remove_lcr option with exclude_regions #3298

Closed hackdna closed 2 years ago

hackdna commented 4 years ago

remove_lcr was deprecated in 2018: https://github.com/bcbio/bcbio-nextgen/commit/17d8bb0623888a55a0c1b7ba52f1dd0f9d4dc5b4 (see also: https://github.com/bcbio/bcbio-nextgen/issues/3267#issuecomment-655932497)

All instances of remove_lcr should be replaced with exclude_regions in the documentation, as well as configuration files (https://github.com/bcbio/bcbio-nextgen/search?q=remove_lcr&unscoped_q=remove_lcr). This also requires adding a corresponding lcr option if not already included.

amizeranschi commented 2 years ago

Hi @naumenko-sa

How does the lcr option work? Where are these regions defined, with respect to the reference genome that's being used? And is there such a list of LCRs for sacCer3? I have checked the file sacCer3-resources.yaml, but there doesn't seem to be any lcr option defined in there.

After some online searching I found mention of a list of blacklist regions in sacCer3 (see slide no. 12 in the link below), which were relevant to a ChIP-seq analysis. These seem to be LCRs. I wonder if there's more.

https://biohpc.cornell.edu/doc/Epigenomics_2020_Lecture2.pdf

naumenko-sa commented 2 years ago

Hi @amizeranschi !

The LCR option is described in the docs in detail - search for lcr: https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#algorithm-parameters

There are many filtration strategies in different pipelines. The exclude_regions (a former remove_lcr: true) is for small variants calling (variant2 pipeline) There are similar filtration strategies in

Sergey

amizeranschi commented 2 years ago

Thanks for clarifying. It would be useful if the exclude_regions option also accepted a BED file directly, as well, similar to how it works with variant_regions, but with exclude_regions, the provided BED file would effectively get subtracted from the reference genome.

naumenko-sa commented 2 years ago

Thank you for the suggestion!

Unfortunately, this improvement does not pass the activation threshold: there is a lot of logic related to bed files in bcbio in many use cases: callable regions, variant calling, SV calling, WGS, WES, panels, amplicons, QC/fastqc/multiqc/mosdepth/qualimap, coverage reporting, remove extracontigs. You may see how complex it gets https://github.com/bcbio/bcbio-nextgen/issues/3609. For example, if we remove regions, do we remove them from the coverage analysis (we should not), or just from variant calling (then we create a discrepancy). So by adding this we risk to dive into a big debug session with a little gain in the end. Thus, I would counter-suggest, if I may, just to use bedtools subtract -a original.bed -b to_remove.bed > production.bed and use the production.bed in variant_regions and coverage for that use case. In a similar way, some users expand their bed files to capture more variants for somatic analyses and purecn CNA.

amizeranschi commented 2 years ago

Hah, fair enough, you make a good point. FYI, I already have your suggestions implemented myself, but I figured having them implemented in bcbio somehow could make things easier for other people:

## Download annotation files with repetitive regions from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/${bcbio_genome}/database/):
## microsatellite file:
wget http://hgdownload.soe.ucsc.edu/goldenPath/${bcbio_genome}/database/microsat.txt.gz
gunzip -f microsat.txt.gz
## simple repeats file:
wget http://hgdownload.soe.ucsc.edu/goldenPath/${bcbio_genome}/database/simpleRepeat.txt.gz
gunzip -f simpleRepeat.txt.gz
## Extract relevant columns from these files into BED files and concatenate them
cut -f2-4 microsat.txt > microsat.bed
cut -f2-4 simpleRepeat.txt > simpleRepeat.bed
## Concatenate the files, sort and merge the files
cat *.bed | sort-bed - | bedops --merge - > lcr-${bcbio_genome}.bed
## Remove unneeded files
rm -f microsat.txt microsat.bed simpleRepeat.txt simpleRepeat.bed
## move the file with lcr into the bcbio genome directory
mv lcr-${bcbio_genome}.bed ${bcbio_path:?}/genomes/${bcbio_species}/${bcbio_genome}/seq/lcr-sorted-uniq.bed
## Subtract the repetitive sequences from the variant_regions.bed file
bedops --difference variant_regions.bed ${bcbio_path:?}/genomes/${bcbio_species}/${bcbio_genome}/seq/lcr-${bcbio_genome}.bed > variant_regions.bed
amizeranschi commented 2 years ago

@naumenko-sa One more question on this topic and I'd rather write it here instead of opening a separate thread. Regarding what you wrote earlier:

There are similar filtration strategies in

* chip-seq -  this is close to your original question
  https://github.com/roryk/chipseq-greylist
  https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/chipseq/peaks.py#L127

How does chipseq greylisting work and in which situations should it run successfully? I see the following messages in my runs:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/chipseq/peaks.py#L150

What is causing them and how could I get greylisting to work with my data?