Clinical-Genomics-Lund / nextflow_wgs

5 stars 5 forks source link

Custom intervals in update_bed.pl #210

Open Jakob37 opened 2 months ago

Jakob37 commented 2 months ago

Description and reviewer info

The purpose here is to generalize update_bed.pl such that we easily can add in one or several custom bed files.

It seems we already have one hard-coded "agilient_hg38_nochr_noalt_1-3.bed". I see no reason for this not being "just another custom bed".

The idea is to run the script as such:

update_bed.pl \
  --old old_clinvar.vcf \
  --new new_clinvar.vcf \
  --build 108 \
  --clinvardate 20240701 \
  --incl_bed agilent.bed \
  --incl_bed custom.bed

Then we could decide whether we would want to bunch several bed together with all our added regions, or keep different bed for different special regions separate for clarity.

I'll test this one a bit more to see that the numbers align with what is seen in the version in current master.

Type of change

Checklist

Further fixes before merge:

Documentation

Patch

Major / Minor change

Test/review documentation

Review performed by

(Add if missing)

Testing performed by

Jakob37 commented 2 months ago

OK, some testing of the old and the new.

Using the master branch (2>/dev/null is needed to omit a flood of warnings). Note the ">" line in which I print the total size of the ClinVar hashes. The new one is identical to "clinvar in common between versions".

$ perl update_bed.pl --old testdata/clinvar38_20231230.vcf --new testdata/clinvar38_20240624.vcf --build 108 --clinvardate PLACEHOLDER 2>/dev/null
> In compare_clinvar new: 270344 old: 249890
clinvar in common between versions :270344
added new(unique targets)          :38735(1274)
removed old(unique targets)        :18281(38)

Using the PR-branch:

$ perl update_bed.pl --old testdata/clinvar38_20231230.vcf --new testdata/clinvar38_20240624.vcf --build 108 --clinvardate PLACEAHOLDER --incl_bed bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed --skip_download
Additional included BED files: bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
ClinVar in common between versions :231609
Added new(unique targets)          :38735(1274)
Removed old(unique targets)        :18281(38)
Jakob37 commented 2 months ago

One thing to discuss. In the previous output, the bed file looked like this:

1       12080   12251   AGILENT-EXOME
1       12595   12802   AGILENT-EXOME
1       13163   13658   AGILENT-EXOME
1       14347   14520   AGILENT-EXOME
1       14620   15015   AGILENT-EXOME
1       15795   15914   AGILENT-EXOME

Here, after update, I just wrote the path to the used bed file:

1       12080   12251   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       12595   12802   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       13163   13658   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       14347   14520   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       14620   15015   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       15795   15914   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       16743   17098   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed

We could add the option to provide tags to fix this. Or just use the base filename. Anyway, wanted to discuss before I start poking around further in this.

There is also some inelegance going on

1       925839  926194  AGILENT-EXOME,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108
1       930135  930505  EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,AGILENT-EXOME,EXONS-108

I suggest this is reduced either to "AGILENT-EXOME,EXONS-108(8)" or just "AGILENT-EXOME,EXONS-108"

ViktorHy commented 2 months ago

One thing to discuss. In the previous output, the bed file looked like this:

1       12080   12251   AGILENT-EXOME
1       12595   12802   AGILENT-EXOME
1       13163   13658   AGILENT-EXOME
1       14347   14520   AGILENT-EXOME
1       14620   15015   AGILENT-EXOME
1       15795   15914   AGILENT-EXOME

Here, after update, I just wrote the path to the used bed file:

1       12080   12251   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       12595   12802   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       13163   13658   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       14347   14520   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       14620   15015   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       15795   15914   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed
1       16743   17098   bin/reference_tools/agilient_hg38_nochr_noalt_1-3.bed

We could add the option to provide tags to fix this. Or just use the base filename. Anyway, wanted to discuss before I start poking around further in this.

There is also some inelegance going on

1       925839  926194  AGILENT-EXOME,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108
1       930135  930505  EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,EXONS-108,AGILENT-EXOME,EXONS-108

I suggest this is reduced either to "AGILENT-EXOME,EXONS-108(8)" or just "AGILENT-EXOME,EXONS-108"

I think the reason I liked the AGILENT-EXOME is that is much more readable if you load the bed-file into some UI like IGV. But the gain is very small and is outweighed by the traceability of giving the path to the file. The second point about reducing the exon output I'm all for it. I was just lazy when I did this script originally

Jakob37 commented 1 month ago

Increased user friendliness. Some variables changed names to be a bit more descriptive. I dont see anything major that would break the function of update_bed.pl

I thought you said GPT dissed the script. Looks good to merge.

OK, thanks! I'll proceed with merging.

I asked ChatGPT again post-changes. Now it seems to think it is "fine with room for improvement" 👍

Sure, here's a more concise critique of your Perl script:

Strengths:

    Flexible Input Handling: Efficiently manages command line options.
    Modular Design: Organizes code into functions, aiding maintenance.
    Checks for Required Parameters: Prevents execution without necessary inputs.

Areas for Improvement:

    Hardcoded Values: Parameters like genomic padding are fixed; consider making these configurable.
    Clean Up Code: Remove unused code and improve commenting for clarity.
    Robust Error Handling: Enhance error management throughout, particularly around file and external command operations.
    Dependencies: The script relies on external tools like wget and bedtools; ensure these are available or handle scenarios where they're not.
    Documentation: Sparse in key areas; add detailed comments explaining complex logic.

Overall: The script is functional but would benefit from greater flexibility, better error handling, and improved documentation for maintainability and usability.