Genomic Region Preselection files for hg19

Kange2014 commented 10 months ago

Hi, It's nice to see your work GCparagon to perform GC bias correction at fragment level. However, the required genomic region preselection files are only provided for hg38. This is not applicable to many cases that use hg19 reference genome.

So I tried my best to generate ones for hg19. But I found it's difficult to use the corresponding scripts to do such a work. Could u help generate hg19 ones kindly?

Thanks.

BGSpiegl commented 10 months ago

Hi Lucius. I am sorry to hear that you have troubles with the code. I will have a look at the scripts and try to generate hg19 pre-selected regions for you. Do the ENCODE blacklisted regions suffice for your purposes? BR Benjamin

Kange2014 commented 9 months ago

Hi Benjamin, Thanks for your prompt reply and kind help. If possible, I'd also like to add UCSC hg19 gap track into this encode blacklisted regions https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg19-blacklist.v2.bed.gz as our excluded regions

Kange2014 commented 9 months ago

Any updates?

BGSpiegl commented 9 months ago

I went through the region preselection scripts and found the logic flaw. One required part is located in the validation code of the including_EGAS00001006963_results branch. I added a new script taking this code and went through the region preselection script-by-script. The hg19 reference files should be ready tomorrow. The current code is in the dev-hg19ExclusionList branch which I will merge into main once everything is ready. In case of any chromosome or alternate contig naming issues between the hg19.2bit reference and your samples, following the preselection process yourself using the upcoming v0.6.3 should work like a charm.

Kange2014 commented 9 months ago

Many thanks. Expect to see the hg19 genomic region preselection files.

BGSpiegl commented 9 months ago

The hg19 reference files have been created and are available now through the linked dev-hg19ExclusionList branch.

Please test these files together with the hg19.2bit reference file that can be downloaded from UCSC using the code in the comments of the EXECUTE_reference_download.sh script.

The genomic interval preselection scripts should work now as intended. The README.md file has been updated to describe to the user who wishes to use a non-hg38/hg19 reference which files have to be created for which experiments. Please also read the comments in the GI-preselection scripts inside the hg19 preselection directory for more detailed warnings.

I will merge the branch into main and close this issue if no further problems arise from using the hg19 reference files the next days.

Kange2014 commented 9 months ago

Really appreciate your help. If my understanding is not wrong, the preselection file for hg19 is hg19_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed. However, only 500+ 1Mb intervals are selected for downstream GC bias correcting. Do you think this number is reasonable?

Thanks.

Kange2014 commented 9 months ago

In addition, hg19_reference_GC_content_distribution.tsv file should use relative values in the 2nd column.

BGSpiegl commented 9 months ago

Hi Lucius.

Concerning the hg19 pre-selected regions: The combined hg19 exclusion regions (only non-chrY, non-chrM standard chromosomes counted) contained 486,086,410 bp which corresponds to 16.01% of hg19. In comparison, the hg38 exclusion regions contain 361,715,043 bp which corresponds to 11.9% of hg38. So my gut tells me that the number of "pre-selectable" intervals should be lower than for hg38 (1,699) but you are right in that 500 Mb of the reference as a result of pre-selection is too small! Actually, there was a bug caused by a partially wrong BedTools flag being used in 01-GI-preselection_test_Mbp_intervals_against_ExclusionList_hg19.py which reduced the genomic intervals available for pre-selection to those that show any overlap with the exclusion marked regions. This caused "good intervals" showing no overlap to be dropped from the BedTools output.

I want to thank you for poking me on this issue! Finding that buggy flag would've probably taken much longer without you. Now the number of available intervals is more reasonable: 2,187 1Mbp regions (= 72% of hg19) For other users using hg38 reading this: the hg38 pre-selection will be updated soon!

In general, parameters that can be tweaked during the genomic region pre-selection process to get a more lenient region selection are:

increasing the REGION_OVERLAP_PERCENTAGE_THRESHOLD in 02-GI-preselection_select_low_score_regions_from_overlapping_hg19.py,
use more fine-grained shifting of genomic intervals (like 16x instead of 4x) and/or
create a less restrictive exclusion marked regions list. Or any combination of the above.

Concerning your initial question of how reasonable a number of ~500 pre-selected regions would be: For the validation samples B01, C01, H01, and P01, the v0.6.0 code selected the following number of pre-selected regions in my local benchmark:

B01: 681
C01: 592
H01: 536
P01: 436

In general, you should be able to work with 500 regions (unless you are correcting samples with average DoC way below 5x which will use more than 700 intervals). But this problem has been solved now.

Concerning the second column of hg19_reference_GC_content_distribution.tsv, please note that the GCparagon code of the hg19 branch got updated! You need to clone and use the new branch with the hg19 files! DO NOT use the older version from the main branch with the hg19 files! I will merge the hg19 branch once the new hg38 pre-selected regions are available.

I updated the branched code to always transform these values to relative frequencies in the read_gc_distribution() function:

# make sure these sum up to 1.0
gc_dist_sum = sum(gc_dist)
if not (0.9999 < gc_dist_sum < 1.0001):
    gc_dist = [rf/gc_dist_sum for rf in gc_dist]  # force elements of distribution to sum up to 1.0

Therefore, relative or absolute values can be used in the 2nd column for the reference FGCD in hg19_reference_GC_content_distribution.tsv.

BGSpiegl commented 9 months ago

A note: the hg38 reference was not affected by erroneous BedTools flag usage. I changed the hg38 exclusion marked regions to include also GRC open incidents and removed tandem repeat finder and non-uniquely mappable regions which increased the number of available pre-selected regions from ~1.7 k to ~2.7k.

BGSpiegl / GCparagon

Genomic Region Preselection files for hg19 #4