bedtools getFASTA is not converting a whole BED file

Hello,

I have tried using bedtools getFASTA on large BED12 files, but I have noticed that the function does not complete and fails when it hits a line that creates this error:

Error: cannot construct subsequence with negative offset or length < 1

My interpretation of this error message (and from others online) is that when the start coordinates are greater than the end coordinates, bedtools getFASTA will stop the conversion. To find the lines in the BED12 file that were causing this error, I created the following script to convert each line in the BED12 file line-by-line into a FASTA file:

# BED files
bed_files=($(ls *.bed))

for bed_file in "${bed_files[@]}"; do
    line_number=1
    while IFS=$'\t' read -r chrom start end name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts; do
        # Check if start < end (valid coordinates)
        if [ "$start" -ge "$end" ]; then
            echo "Error: Invalid coordinates at line $line_number:"
            echo "Chrom: $chrom, Start: $start, End: $end"
            continue
        fi

        # Check if the coordinates are valid
        if [ "$start" -ge 0 ] && [ "$end" -ge 0 ]; then
            echo "Processing line $line_number..."
            chr=${bed_file##*_}
            chr=${chr%.bed}
            temp_bed="$BED_tempDir/D60_brain_organoid_${chr}_line${line_number}.bed"
            echo -e "$chrom\t$start\t$end\t$name\t$score\t$strand\t$thickStart\t$thickEnd\t$itemRgb\t$blockCount\t$blockSizes\t$blockStarts" > "$temp_bed"

            # Run bedtools getfasta for each temp_bed12 and handle errors
            if bedtools getfasta -s -split -name -fi "$genome" -bed "$temp_bed" -fo "$FASTA_tempDir/D60_brain_organoid_${chr}_line${line_number}.fa" 2>&1; then
                echo "D60_brain_organoid_${chr}_line${line_number}.fa successfully created"
            else
                echo "Error: Couldn't create D60_brain_organoid_${chr}_line${line_number}.fa"
            fi
        else
            echo "Error: Invalid coordinates at line $line_number."
        fi

        ((line_number++)) 
    done < "$bed_file"
done

From this script, I got the following 3 lines out of the total 2875674 that caused this error:

Error: cannot construct subsequence with negative offset or length < 1
Error: Couldn't create D60_brain_organoid_chr12_line188216.fa
Error: cannot construct subsequence with negative offset or length < 1
Error: Couldn't create D60_brain_organoid_chr14_line31475.fa
Error: cannot construct subsequence with negative offset or length < 1
Error: Couldn't create D60_brain_organoid_chr16_line108490.fa

Using awk, I searched for these lines in the original BED12 file and got these lines where the start coordinates do seem to be less than the end coordinates:

# Line 188216
chr12   111894105   111903903   m64403e_230116_171857/105318615/ccs 18  +   111894105   111903903   255,0,0 3   132,0,286   0,1620,9512

# Line 31475
chr14   52699325    52716208    m64403e_230116_171857/90505969/ccs  60  +   52699325    52716208    255,0,0 3   78,0,381    0,404,16502

# Line 108490
chr16   74866465    74896106    m64403e_230116_171857/140837374/ccs 60  -   74866465    74896106    255,0,0 3   417,0,17    0,2512,29624

Since these coordinates look valid, I was wondering why these lines are causing errors for getFASTA? Also, would it be possible to input some new code to handle this error so that if getFASTA runs into a line that creates this error, it doesn't convert that line, but continues to convert the rest of the file? The line-by-line getFASTA script that I created above took multiple hours to run, so it would be very convenient to have this new error handling since only a few lines out of the almost 3 million lines were unable to convert.

Thank you for your help in advance!

Best regards, Mia

Hi Mia,

Thanks so much for reporting this. Would it be possible to share the FASTA files and the command line invocation for each of these problematic lines? It looks like they each use a different input FASTA?

All the best, Aaron

I see those are the output file names now. Could you share your input genome file?

Hi Aaron,

Thank you for your quick response! For my FASTA file, I am using the hg38 genome without contigs.

Here is the full script for reference:

#!/bin/bash
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH -n 24
#SBATCH --job-name=bed2fasta
#SBATCH -o bed2fasta_%j.log
#SBATCH -e bed2fasta_%j.log
#SBATCH --mail-type=ALL,TIME_LIMIT_80,TIME_LIMIT
#SBATCH -t 48:00:00
#SBATCH --mem=10gb

### Script: bed2fasta.sh
### Usage: sbatch bed2fasta.sh
### Purpose: Convert BED files to FASTA format for each chromosome.
### Created: 3/4/2024

# Modules
module purge all
module load bedtools/2.30.0

# BED file directory
inputDir="/projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/organoid/D60_brain_organoid/TAMA_ORF_NMD_pipeline/split_BEDs"
cd "$inputDir"

# Output directory
outputDir="/projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/organoid/D60_brain_organoid/TAMA_ORF_NMD_pipeline/temp_files"
[ ! -d "$outputDir" ] && mkdir -p "$outputDir"
# Directory for temporary BED files
BED_tempDir="${outputDir}/temp_beds"
[ ! -d "$BED_tempDir" ] && mkdir -p "$BED_tempDir"
# Directory for temporary FASTA files
FASTA_tempDir="${outputDir}/temp_fasta"
[ ! -d "$FASTA_tempDir" ] && mkdir -p "$FASTA_tempDir"

# hg38 genome
genome="/projects/b1073/pipelines/commonref/GRCh38/GRCh38_NoContigs.primary_assembly.genome.fa"

# BED files
bed_files=($(ls *.bed))

for bed_file in "${bed_files[@]}"; do
    line_number=1
    while IFS=$'\t' read -r chrom start end name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts; do
        # Check if start < end (valid coordinates)
        if [ "$start" -ge "$end" ]; then
            echo "Error: Invalid coordinates at line $line_number:"
            echo "Chrom: $chrom, Start: $start, End: $end"
            continue
        fi

        # Check if the coordinates are valid
        if [ "$start" -ge 0 ] && [ "$end" -ge 0 ]; then
            echo "Processing line $line_number..."
            chr=${bed_file##*_}
            chr=${chr%.bed}
            temp_bed="$BED_tempDir/D60_brain_organoid_${chr}_line${line_number}.bed"
            echo -e "$chrom\t$start\t$end\t$name\t$score\t$strand\t$thickStart\t$thickEnd\t$itemRgb\t$blockCount\t$blockSizes\t$blockStarts" > "$temp_bed"

            # Run bedtools getfasta for each temp_bed12 and handle errors
            if bedtools getfasta -s -split -name -fi "$genome" -bed "$temp_bed" -fo "$FASTA_tempDir/D60_brain_organoid_${chr}_line${line_number}.fa" 2>&1; then
                echo "D60_brain_organoid_${chr}_line${line_number}.fa successfully created"
            else
                echo "Error: Couldn't create D60_brain_organoid_${chr}_line${line_number}.fa"
            fi
        else
            echo "Error: Invalid coordinates at line $line_number: Start or end coordinate is negative."
        fi

        ((line_number++)) 
    done < "$bed_file"
done

For some more context, I initially convert a long-read RNA-seq BAM file into BED12 files using bedtools bam2bed and each of the BED12 files is split by chromosome, resulting in 24 separate BED12 files. For each of the BED12 files, I iterate over them and for each line of the BED12 files, I create a temporary BED12 file of that single line and check if the coordinates of that line are valid or not, and if they are, it'll convert that BED12 file into a FASTA file with that single line using getFASTA. I did this originally thinking I could concatenate all the temporary BED12 and temporary FASTA files into single files, but that took forever, which led me to do the following: once I find the lines that create the error I first mentioned, I get rid of them using awk, concatenate all the BED12 files into a single BED12 file, and am then able to use getFASTA on the one BED12 file with no issue.

The hg38 FASTA file is too large to upload here, but I'm happy to send it to you in another way if you would still like it.

Thank you, Mia

Hi Mia, the issue is that the 11th column (block sizes: https://genome.ucsc.edu/FAQ/FAQformat.html#format1) in each of these 3 records has a length of 0 as the second length (e.g., "132,0,286"). Bedtools rejects this because a block such as an exon should not have zero length if it is a block.

If I change that 0 to something non-zero (e.g., "132,1,286"), it works:

cat mia.2.bed
chr12   111894105   111903903   m64403e_230116_171857/105318615/ccs 18  +   111894105   111903903   255,0,0 3   132,1,286   0,1620,9512

bedtools getfasta -s -split -name -fi chr12.fa -bed mia.2.bed
>m64403e_230116_171857/105318615/ccs::chr12:111894105-111903903(+)
gatctcggctcactgcaacctctgtctcccaggttcaagtgattctcctgcctcagcctcccaagtagttgggattacaggcacccgccaccatgcccggctaatttttgtatttttagtagagacagggtttgttgcagtgagctatgataccaccacagtactctagcctgggcgacagagcaagaccctgtctcagataaataaaaaCcattcattctgttctaggcactgtgttaaagcaatactttacattcattgccttaatttaatcctcccaacttcccaataaaacctccatcttgtacatgaggaaacagagactgaagagtttgctctagagctagctagtaagtggtgggacctgcatccctccccagaaatgagtgaatctactgcctcatgaaccttttccTTCC

Is it possible that this use of a zero-length block is a bug in some upstream software?

Also, the main bedtools repository is (confusingly) here (bedtools2): https://github.com/arq5x/bedtools2. In the future, please file issues there. Thanks!

Hi Aaron,

I didn't notice that the exon was 0 length in the 11th column; thank you so much for finding that error! I used bedtools bamtobed to convert my BAM files into BED12 format. Here is the part of my script where I used it:

# Modules
module purge all
module load bedtools/2.30.0
module load samtools/1.14

# Chromosome array
chrom=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY")

# Split BAM by chromosome and create BED12 files
for chr in "${chrom[@]}"
do
    echo "Splitting bam into ${chr}.bam..."
    bam_file_name=$(basename ${bam} .sort.bam)
    split_bam="${BAM_outputDir}/${bam_file_name}_${chr}.bam"
    split_bam_sorted="$BAM_outputDir/${bam_file_name}_${chr}.sort.bam"
    samtools view -@ 24 -h -b "$bam" "$chr" > "$split_bam"
    samtools sort -@ 24 "$split_bam" -o "$split_bam_sorted"
    samtools index "$split_bam_sorted"
    echo "Creating BED12 file for ${split_bam_sorted}..."
    output_bam="${BED_outputDir}/D60_brain_organoid_${chr}.bed"
    bedtools bamtobed -i "$split_bam_sorted" -bed12 -split -tag > "$output_bam"
done

The BAM file that is split by chromosome is from PacBio IsoSeq data, and the BAM is preprocessed to the point of full-length non-concatemer (FLNC) reads. I've also ran bedtools bamtobed on a different BAM file and then filtered the BED12 file for only reads that supported GENCODE annotated transcripts, and I also ran into the same issue afterwards with bedtools getFASTA.

If you prefer, I can continue this conversation (aka copy and paste this issue) in the github repository you linked if that is better; otherwise, I will post issues in the future there.

Thank you for all your help!

Best regards, Mia

Hmm, I wonder if there is a bug in bamtobed for this. Could you send me the head of the SAM file, as well as the SAM record for one of the problematic alignments and I will have a look?

Hi Aaron,

Here is the header of the SAM file:

@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr12    LN:133275309
@SQ SN:chr13    LN:114364328
@SQ SN:chr14    LN:107043718
@SQ SN:chr15    LN:101991189
@SQ SN:chr16    LN:90338345
@SQ SN:chr17    LN:83257441
@SQ SN:chr18    LN:80373285
@SQ SN:chr19    LN:58617616
@SQ SN:chr20    LN:64444167
@SQ SN:chr21    LN:46709983
@SQ SN:chr22    LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:GL000008.2   LN:209709
@SQ SN:GL000009.2   LN:201709
@SQ SN:GL000194.1   LN:191469
@SQ SN:GL000195.1   LN:182896
@SQ SN:GL000205.2   LN:185591
@SQ SN:GL000208.1   LN:92689
@SQ SN:GL000213.1   LN:164239
@SQ SN:GL000214.1   LN:137718
@SQ SN:GL000216.2   LN:176608
@SQ SN:GL000218.1   LN:161147
@SQ SN:GL000219.1   LN:179198
@SQ SN:GL000220.1   LN:161802
@SQ SN:GL000221.1   LN:155397
@SQ SN:GL000224.1   LN:179693
@SQ SN:GL000225.1   LN:211173
@SQ SN:GL000226.1   LN:15008
@SQ SN:KI270302.1   LN:2274
@SQ SN:KI270303.1   LN:1942
@SQ SN:KI270304.1   LN:2165
@SQ SN:KI270305.1   LN:1472
@SQ SN:KI270310.1   LN:1201
@SQ SN:KI270311.1   LN:12399
@SQ SN:KI270312.1   LN:998
@SQ SN:KI270315.1   LN:2276
@SQ SN:KI270316.1   LN:1444
@SQ SN:KI270317.1   LN:37690
@SQ SN:KI270320.1   LN:4416
@SQ SN:KI270322.1   LN:21476
@SQ SN:KI270329.1   LN:1040
@SQ SN:KI270330.1   LN:1652
@SQ SN:KI270333.1   LN:2699
@SQ SN:KI270334.1   LN:1368
@SQ SN:KI270335.1   LN:1048
@SQ SN:KI270336.1   LN:1026
@SQ SN:KI270337.1   LN:1121
@SQ SN:KI270338.1   LN:1428
@SQ SN:KI270340.1   LN:1428
@SQ SN:KI270362.1   LN:3530
@SQ SN:KI270363.1   LN:1803
@SQ SN:KI270364.1   LN:2855
@SQ SN:KI270366.1   LN:8320
@SQ SN:KI270371.1   LN:2805
@SQ SN:KI270372.1   LN:1650
@SQ SN:KI270373.1   LN:1451
@SQ SN:KI270374.1   LN:2656
@SQ SN:KI270375.1   LN:2378
@SQ SN:KI270376.1   LN:1136
@SQ SN:KI270378.1   LN:1048
@SQ SN:KI270379.1   LN:1045
@SQ SN:KI270381.1   LN:1930
@SQ SN:KI270382.1   LN:4215
@SQ SN:KI270383.1   LN:1750
@SQ SN:KI270384.1   LN:1658
@SQ SN:KI270385.1   LN:990
@SQ SN:KI270386.1   LN:1788
@SQ SN:KI270387.1   LN:1537
@SQ SN:KI270388.1   LN:1216
@SQ SN:KI270389.1   LN:1298
@SQ SN:KI270390.1   LN:2387
@SQ SN:KI270391.1   LN:1484
@SQ SN:KI270392.1   LN:971
@SQ SN:KI270393.1   LN:1308
@SQ SN:KI270394.1   LN:970
@SQ SN:KI270395.1   LN:1143
@SQ SN:KI270396.1   LN:1880
@SQ SN:KI270411.1   LN:2646
@SQ SN:KI270412.1   LN:1179
@SQ SN:KI270414.1   LN:2489
@SQ SN:KI270417.1   LN:2043
@SQ SN:KI270418.1   LN:2145
@SQ SN:KI270419.1   LN:1029
@SQ SN:KI270420.1   LN:2321
@SQ SN:KI270422.1   LN:1445
@SQ SN:KI270423.1   LN:981
@SQ SN:KI270424.1   LN:2140
@SQ SN:KI270425.1   LN:1884
@SQ SN:KI270429.1   LN:1361
@SQ SN:KI270435.1   LN:92983
@SQ SN:KI270438.1   LN:112505
@SQ SN:KI270442.1   LN:392061
@SQ SN:KI270448.1   LN:7992
@SQ SN:KI270465.1   LN:1774
@SQ SN:KI270466.1   LN:1233
@SQ SN:KI270467.1   LN:3920
@SQ SN:KI270468.1   LN:4055
@SQ SN:KI270507.1   LN:5353
@SQ SN:KI270508.1   LN:1951
@SQ SN:KI270509.1   LN:2318
@SQ SN:KI270510.1   LN:2415
@SQ SN:KI270511.1   LN:8127
@SQ SN:KI270512.1   LN:22689
@SQ SN:KI270515.1   LN:6361
@SQ SN:KI270516.1   LN:1300
@SQ SN:KI270517.1   LN:3253
@SQ SN:KI270518.1   LN:2186
@SQ SN:KI270519.1   LN:138126
@SQ SN:KI270521.1   LN:7642
@SQ SN:KI270522.1   LN:5674
@SQ SN:KI270528.1   LN:2983
@SQ SN:KI270529.1   LN:1899
@SQ SN:KI270530.1   LN:2168
@SQ SN:KI270538.1   LN:91309
@SQ SN:KI270539.1   LN:993
@SQ SN:KI270544.1   LN:1202
@SQ SN:KI270548.1   LN:1599
@SQ SN:KI270579.1   LN:31033
@SQ SN:KI270580.1   LN:1553
@SQ SN:KI270581.1   LN:7046
@SQ SN:KI270582.1   LN:6504
@SQ SN:KI270583.1   LN:1400
@SQ SN:KI270584.1   LN:4513
@SQ SN:KI270587.1   LN:2969
@SQ SN:KI270588.1   LN:6158
@SQ SN:KI270589.1   LN:44474
@SQ SN:KI270590.1   LN:4685
@SQ SN:KI270591.1   LN:5796
@SQ SN:KI270593.1   LN:3041
@SQ SN:KI270706.1   LN:175055
@SQ SN:KI270707.1   LN:32032
@SQ SN:KI270708.1   LN:127682
@SQ SN:KI270709.1   LN:66860
@SQ SN:KI270710.1   LN:40176
@SQ SN:KI270711.1   LN:42210
@SQ SN:KI270712.1   LN:176043
@SQ SN:KI270713.1   LN:40745
@SQ SN:KI270714.1   LN:41717
@SQ SN:KI270715.1   LN:161471
@SQ SN:KI270716.1   LN:153799
@SQ SN:KI270717.1   LN:40062
@SQ SN:KI270718.1   LN:38054
@SQ SN:KI270719.1   LN:176845
@SQ SN:KI270720.1   LN:39050
@SQ SN:KI270721.1   LN:100316
@SQ SN:KI270722.1   LN:194050
@SQ SN:KI270723.1   LN:38115
@SQ SN:KI270724.1   LN:39555
@SQ SN:KI270725.1   LN:172810
@SQ SN:KI270726.1   LN:43739
@SQ SN:KI270727.1   LN:448248
@SQ SN:KI270728.1   LN:1872759
@SQ SN:KI270729.1   LN:280839
@SQ SN:KI270730.1   LN:112551
@SQ SN:KI270731.1   LN:150754
@SQ SN:KI270732.1   LN:41543
@SQ SN:KI270733.1   LN:179772
@SQ SN:KI270734.1   LN:165050
@SQ SN:KI270735.1   LN:42811
@SQ SN:KI270736.1   LN:181920
@SQ SN:KI270737.1   LN:103838
@SQ SN:KI270738.1   LN:99375
@SQ SN:KI270739.1   LN:73985
@SQ SN:KI270740.1   LN:37240
@SQ SN:KI270741.1   LN:157432
@SQ SN:KI270742.1   LN:186739
@SQ SN:KI270743.1   LN:210658
@SQ SN:KI270744.1   LN:168472
@SQ SN:KI270745.1   LN:41891
@SQ SN:KI270746.1   LN:66486
@SQ SN:KI270747.1   LN:198735
@SQ SN:KI270748.1   LN:93321
@SQ SN:KI270749.1   LN:158759
@SQ SN:KI270750.1   LN:148850
@SQ SN:KI270751.1   LN:150742
@SQ SN:KI270752.1   LN:27745
@SQ SN:KI270753.1   LN:62944
@SQ SN:KI270754.1   LN:40191
@SQ SN:KI270755.1   LN:36723
@SQ SN:KI270756.1   LN:79590
@SQ SN:KI270757.1   LN:71251
@PG ID:minimap2 PN:minimap2 VN:2.24-r1122   CL:minimap2 -t 40 --MD -ax splice:hq -uf /projects/b1073/pipelines/commonref/GRCh38/GRCh38.primary_assembly.genome.fa /projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/D60_brain_organoid/isoseq3_output/refine/m64403e_230116_171857.flnc.fasta

and here is a SAM record of one of the problematic alignments for one of the lines that errored with getFASTA:

# Line 31475 that errored with getFASTA
chr14   52699325    52716208    m64403e_230116_171857/90505969/ccs  60  +   52699325    52716208    255,0,0 3   78,0,381    0,404,16502

# That same read in the mapped SAM file (note: it also mapped to 2 other locations, one other location on chr14 and the other on chr7)
m64403e_230116_171857/90505969/ccs  2048    chr14   52699326    60  2511H39M1I9M4D26M326N237I16098N3D378M   *   0   0   CTTGTACACTACTGGTGGGAATGTAAATTAGCACAGCCATTTTTGAAAACATGGAGGTTCCTCAAAAAACTAAAAATAGAATTACCATATGATTCAGCAATCCCACTTCTGGGTTTATATCTAAAGGAATTGAAATCAGTGTGTCAGAGATAGCTGCACTCCCATGATTATTTCACAATAGCCAAGATATAGAAACAGCCTAAAAATTGCCCATCAATGGATGAATGGATAAAGAAAATGTGGTAGCCGGGTGCAGTGGCTCATACCTGTAGTGCCAACACTTTGGGAGGCCGAGGCGGGCGGATCACCTGAGGTCTTGAATTCCTGGGCTCAAGCAACCCGCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCACAATGTCCAGCCACGGCAGCTTTCTAATATATTAATACTTAAAGACTTTTCTGATGAGATAAGTGGTGAGAATAACAAAAATTTTTTATAATGTGTGGTGGAAAATGTCAACATTTGGAAGATTTGCATAACTCAACCAGTAGTTTCCAAATAATCAATGCTTGATATTAAAATATTCATAAGTAAAAGATCCAGTCAGTGCACAGGATAGACCAATGTATTTTAATGTAACAGAAGTTTCTGTCATAGTCCATGTTGTAAGTAGATAGCTATTATAAAAAAGACAAAAGTGTTTGCAAGATGT  *   NM:i:249    ms:i:395    AS:i:334    nn:i:0  ts:A:+  tp:A:P  cm:i:112    s1:i:358    s2:i:46 de:f:0.0175 SA:Z:chr7,47964109,-,1013S2187M13320D1S,60,30;chr14,52715721,+,2004S823M374S,60,0;  MD:Z:10G0T19T16^ATAG0T25^GCT378 rl:i:499

Thank you, Mia

This example was very helpful. I just pushed a fix for the bug! If you clone this repository and compile from that, you should have a fix, but please let me know if you still have problems. https://github.com/arq5x/bedtools2

Hi Aaron,

Sorry for the late reply. I just tested the updated bamToBed function on the split chr14 BAM file and then ran getFASTA, and it worked without any errors! Here is output of the previously noted chr14 BED file line:

chr14   52699325    52716208    m64403e_230116_171857/90505969/ccs  60  +   52699325    52716208    255,0,0 2   78,381  0,16502

Now there are no exon blocks of zero length produced.

Thank you very much for all your help and time! I greatly appreciate it!

Best regards, Mia

arq5x / bedtools

bedtools getFASTA is not converting a whole BED file #169