bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Error: malformed BED entry #20

Closed ziphra closed 8 months ago

ziphra commented 1 year ago

Hi,

I tried running straglr like this to detect TR expansions in ONT whole genome sequencing:

python straglr.py mmi.bam ref.fa straglr_scan --min_str_len 2 --max_str_len 100 --min_ins_size 100 --genotype_in_size --min_support 2 --max_num_clusters 2 --nprocs 10 --exclude hg38.exclude.bed

and I generated hg38.exclude.bed like this:

(cut -f1-3 hg38.segdups.bed;awk '$3-$2>=10000' hg38.simple_repeats.bed | cut -f1-3;cat hg38.centromeres.bed hg38_gaps.bed) | awk -v OFS='\t' '{print $1, $2, $3}' | bedtools sort -i - | bedtools merge -i - -d 1000 > hg38.exclude.bed

But while running straglr I get the following warnings:

***** WARNING: File /tmp/tmpmof4bq6q has inconsistent naming convention for record:
GL000008.2  0   209709

***** WARNING: File /tmp/tmpmof4bq6q has inconsistent naming convention for record:
GL000008.2  0   209709

^[[19~***** WARNING: File /tmp/tmpx0ihvtut has inconsistent naming convention for record:
KI270915.1  7002    7003    2d8640dc-1a9a-43d3-98df-be39a98ef816_851_196

***** WARNING: File /tmp/tmpx0ihvtut has inconsistent naming convention for record:
KI270915.1  7002    7003    2d8640dc-1a9a-43d3-98df-be39a98ef816_851_196

And then this error:

pybedtools.helpers.BEDToolsError: 
Command was:

    bedtools sort -i /tmp/pybedtools.fxeierw9.tmp

Error message was:
Error: malformed BED entry at line 787. Start Coordinate detected that is < 0. Exiting.

Should I remove all patchs alignments from my bam ?

I would love some insights on that.

Many thanks in advance

ziphra

readmanchiu commented 1 year ago

The chromosome names in the "exclude" bed file should have the same convention as in the alignment bam file. The error may arise from there. The second error:

Error: malformed BED entry at line 787. Start Coordinate detected that is < 0. Exiting.

is also related to the patch chromosomes. So right now the best solution will be to remove the the patch alignments as you suggested.

Thanks for trying Straglr, please let me know if there is any other issues.

ziphra commented 1 year ago

Thank you for your prompt response.

Instead of removing patch alignments, I changed their name in the exclude bed so they would match the bam file. For those having the same problem and who want to keep their patches alignment: They were written like that in my exclude bed: "chrY_MU273398v1_fix" while being written like this "MU273398.1" in my bam, so I did

sed 's/^[^_]*_//'  hg38.exclude.bed | awk 'BEGIN{FS=OFS="\t"} {sub(/[_].*/,"",$1)} 1' | sed 's/v1/\.1/' | sed 's/v2/\.2/' >  hg38_renamed.exclude.bed
readmanchiu commented 1 year ago

A new version has been released that should also handle ALT chromosome alignments. But the chromosome names between BAM and the bed file for --exclude still have to agree