Closed mbroad1 closed 8 months ago
Hi Mia,
Thanks so much for reporting this. Would it be possible to share the FASTA files and the command line invocation for each of these problematic lines? It looks like they each use a different input FASTA?
All the best, Aaron
I see those are the output file names now. Could you share your input genome file?
Hi Aaron,
Thank you for your quick response! For my FASTA file, I am using the hg38 genome without contigs.
Here is the full script for reference:
#!/bin/bash
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH -n 24
#SBATCH --job-name=bed2fasta
#SBATCH -o bed2fasta_%j.log
#SBATCH -e bed2fasta_%j.log
#SBATCH --mail-type=ALL,TIME_LIMIT_80,TIME_LIMIT
#SBATCH -t 48:00:00
#SBATCH --mem=10gb
### Script: bed2fasta.sh
### Usage: sbatch bed2fasta.sh
### Purpose: Convert BED files to FASTA format for each chromosome.
### Created: 3/4/2024
# Modules
module purge all
module load bedtools/2.30.0
# BED file directory
inputDir="/projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/organoid/D60_brain_organoid/TAMA_ORF_NMD_pipeline/split_BEDs"
cd "$inputDir"
# Output directory
outputDir="/projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/organoid/D60_brain_organoid/TAMA_ORF_NMD_pipeline/temp_files"
[ ! -d "$outputDir" ] && mkdir -p "$outputDir"
# Directory for temporary BED files
BED_tempDir="${outputDir}/temp_beds"
[ ! -d "$BED_tempDir" ] && mkdir -p "$BED_tempDir"
# Directory for temporary FASTA files
FASTA_tempDir="${outputDir}/temp_fasta"
[ ! -d "$FASTA_tempDir" ] && mkdir -p "$FASTA_tempDir"
# hg38 genome
genome="/projects/b1073/pipelines/commonref/GRCh38/GRCh38_NoContigs.primary_assembly.genome.fa"
# BED files
bed_files=($(ls *.bed))
for bed_file in "${bed_files[@]}"; do
line_number=1
while IFS=$'\t' read -r chrom start end name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts; do
# Check if start < end (valid coordinates)
if [ "$start" -ge "$end" ]; then
echo "Error: Invalid coordinates at line $line_number:"
echo "Chrom: $chrom, Start: $start, End: $end"
continue
fi
# Check if the coordinates are valid
if [ "$start" -ge 0 ] && [ "$end" -ge 0 ]; then
echo "Processing line $line_number..."
chr=${bed_file##*_}
chr=${chr%.bed}
temp_bed="$BED_tempDir/D60_brain_organoid_${chr}_line${line_number}.bed"
echo -e "$chrom\t$start\t$end\t$name\t$score\t$strand\t$thickStart\t$thickEnd\t$itemRgb\t$blockCount\t$blockSizes\t$blockStarts" > "$temp_bed"
# Run bedtools getfasta for each temp_bed12 and handle errors
if bedtools getfasta -s -split -name -fi "$genome" -bed "$temp_bed" -fo "$FASTA_tempDir/D60_brain_organoid_${chr}_line${line_number}.fa" 2>&1; then
echo "D60_brain_organoid_${chr}_line${line_number}.fa successfully created"
else
echo "Error: Couldn't create D60_brain_organoid_${chr}_line${line_number}.fa"
fi
else
echo "Error: Invalid coordinates at line $line_number: Start or end coordinate is negative."
fi
((line_number++))
done < "$bed_file"
done
For some more context, I initially convert a long-read RNA-seq BAM file into BED12 files using bedtools bam2bed and each of the BED12 files is split by chromosome, resulting in 24 separate BED12 files. For each of the BED12 files, I iterate over them and for each line of the BED12 files, I create a temporary BED12 file of that single line and check if the coordinates of that line are valid or not, and if they are, it'll convert that BED12 file into a FASTA file with that single line using getFASTA. I did this originally thinking I could concatenate all the temporary BED12 and temporary FASTA files into single files, but that took forever, which led me to do the following: once I find the lines that create the error I first mentioned, I get rid of them using awk, concatenate all the BED12 files into a single BED12 file, and am then able to use getFASTA on the one BED12 file with no issue.
The hg38 FASTA file is too large to upload here, but I'm happy to send it to you in another way if you would still like it.
Thank you, Mia
Hi Mia, the issue is that the 11th column (block sizes: https://genome.ucsc.edu/FAQ/FAQformat.html#format1) in each of these 3 records has a length of 0 as the second length (e.g., "132,0,286"). Bedtools rejects this because a block such as an exon should not have zero length if it is a block.
If I change that 0 to something non-zero (e.g., "132,1,286"), it works:
cat mia.2.bed
chr12 111894105 111903903 m64403e_230116_171857/105318615/ccs 18 + 111894105 111903903 255,0,0 3 132,1,286 0,1620,9512
bedtools getfasta -s -split -name -fi chr12.fa -bed mia.2.bed
>m64403e_230116_171857/105318615/ccs::chr12:111894105-111903903(+)
gatctcggctcactgcaacctctgtctcccaggttcaagtgattctcctgcctcagcctcccaagtagttgggattacaggcacccgccaccatgcccggctaatttttgtatttttagtagagacagggtttgttgcagtgagctatgataccaccacagtactctagcctgggcgacagagcaagaccctgtctcagataaataaaaaCcattcattctgttctaggcactgtgttaaagcaatactttacattcattgccttaatttaatcctcccaacttcccaataaaacctccatcttgtacatgaggaaacagagactgaagagtttgctctagagctagctagtaagtggtgggacctgcatccctccccagaaatgagtgaatctactgcctcatgaaccttttccTTCC
Is it possible that this use of a zero-length block is a bug in some upstream software?
Also, the main bedtools repository is (confusingly) here (bedtools2): https://github.com/arq5x/bedtools2. In the future, please file issues there. Thanks!
Hi Aaron,
I didn't notice that the exon was 0 length in the 11th column; thank you so much for finding that error! I used bedtools bamtobed to convert my BAM files into BED12 format. Here is the part of my script where I used it:
# Modules
module purge all
module load bedtools/2.30.0
module load samtools/1.14
# Chromosome array
chrom=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY")
# Split BAM by chromosome and create BED12 files
for chr in "${chrom[@]}"
do
echo "Splitting bam into ${chr}.bam..."
bam_file_name=$(basename ${bam} .sort.bam)
split_bam="${BAM_outputDir}/${bam_file_name}_${chr}.bam"
split_bam_sorted="$BAM_outputDir/${bam_file_name}_${chr}.sort.bam"
samtools view -@ 24 -h -b "$bam" "$chr" > "$split_bam"
samtools sort -@ 24 "$split_bam" -o "$split_bam_sorted"
samtools index "$split_bam_sorted"
echo "Creating BED12 file for ${split_bam_sorted}..."
output_bam="${BED_outputDir}/D60_brain_organoid_${chr}.bed"
bedtools bamtobed -i "$split_bam_sorted" -bed12 -split -tag > "$output_bam"
done
The BAM file that is split by chromosome is from PacBio IsoSeq data, and the BAM is preprocessed to the point of full-length non-concatemer (FLNC) reads. I've also ran bedtools bamtobed on a different BAM file and then filtered the BED12 file for only reads that supported GENCODE annotated transcripts, and I also ran into the same issue afterwards with bedtools getFASTA.
If you prefer, I can continue this conversation (aka copy and paste this issue) in the github repository you linked if that is better; otherwise, I will post issues in the future there.
Thank you for all your help!
Best regards, Mia
Hmm, I wonder if there is a bug in bamtobed for this. Could you send me the head of the SAM file, as well as the SAM record for one of the problematic alignments and I will have a look?
Hi Aaron,
Here is the header of the SAM file:
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:GL000008.2 LN:209709
@SQ SN:GL000009.2 LN:201709
@SQ SN:GL000194.1 LN:191469
@SQ SN:GL000195.1 LN:182896
@SQ SN:GL000205.2 LN:185591
@SQ SN:GL000208.1 LN:92689
@SQ SN:GL000213.1 LN:164239
@SQ SN:GL000214.1 LN:137718
@SQ SN:GL000216.2 LN:176608
@SQ SN:GL000218.1 LN:161147
@SQ SN:GL000219.1 LN:179198
@SQ SN:GL000220.1 LN:161802
@SQ SN:GL000221.1 LN:155397
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000225.1 LN:211173
@SQ SN:GL000226.1 LN:15008
@SQ SN:KI270302.1 LN:2274
@SQ SN:KI270303.1 LN:1942
@SQ SN:KI270304.1 LN:2165
@SQ SN:KI270305.1 LN:1472
@SQ SN:KI270310.1 LN:1201
@SQ SN:KI270311.1 LN:12399
@SQ SN:KI270312.1 LN:998
@SQ SN:KI270315.1 LN:2276
@SQ SN:KI270316.1 LN:1444
@SQ SN:KI270317.1 LN:37690
@SQ SN:KI270320.1 LN:4416
@SQ SN:KI270322.1 LN:21476
@SQ SN:KI270329.1 LN:1040
@SQ SN:KI270330.1 LN:1652
@SQ SN:KI270333.1 LN:2699
@SQ SN:KI270334.1 LN:1368
@SQ SN:KI270335.1 LN:1048
@SQ SN:KI270336.1 LN:1026
@SQ SN:KI270337.1 LN:1121
@SQ SN:KI270338.1 LN:1428
@SQ SN:KI270340.1 LN:1428
@SQ SN:KI270362.1 LN:3530
@SQ SN:KI270363.1 LN:1803
@SQ SN:KI270364.1 LN:2855
@SQ SN:KI270366.1 LN:8320
@SQ SN:KI270371.1 LN:2805
@SQ SN:KI270372.1 LN:1650
@SQ SN:KI270373.1 LN:1451
@SQ SN:KI270374.1 LN:2656
@SQ SN:KI270375.1 LN:2378
@SQ SN:KI270376.1 LN:1136
@SQ SN:KI270378.1 LN:1048
@SQ SN:KI270379.1 LN:1045
@SQ SN:KI270381.1 LN:1930
@SQ SN:KI270382.1 LN:4215
@SQ SN:KI270383.1 LN:1750
@SQ SN:KI270384.1 LN:1658
@SQ SN:KI270385.1 LN:990
@SQ SN:KI270386.1 LN:1788
@SQ SN:KI270387.1 LN:1537
@SQ SN:KI270388.1 LN:1216
@SQ SN:KI270389.1 LN:1298
@SQ SN:KI270390.1 LN:2387
@SQ SN:KI270391.1 LN:1484
@SQ SN:KI270392.1 LN:971
@SQ SN:KI270393.1 LN:1308
@SQ SN:KI270394.1 LN:970
@SQ SN:KI270395.1 LN:1143
@SQ SN:KI270396.1 LN:1880
@SQ SN:KI270411.1 LN:2646
@SQ SN:KI270412.1 LN:1179
@SQ SN:KI270414.1 LN:2489
@SQ SN:KI270417.1 LN:2043
@SQ SN:KI270418.1 LN:2145
@SQ SN:KI270419.1 LN:1029
@SQ SN:KI270420.1 LN:2321
@SQ SN:KI270422.1 LN:1445
@SQ SN:KI270423.1 LN:981
@SQ SN:KI270424.1 LN:2140
@SQ SN:KI270425.1 LN:1884
@SQ SN:KI270429.1 LN:1361
@SQ SN:KI270435.1 LN:92983
@SQ SN:KI270438.1 LN:112505
@SQ SN:KI270442.1 LN:392061
@SQ SN:KI270448.1 LN:7992
@SQ SN:KI270465.1 LN:1774
@SQ SN:KI270466.1 LN:1233
@SQ SN:KI270467.1 LN:3920
@SQ SN:KI270468.1 LN:4055
@SQ SN:KI270507.1 LN:5353
@SQ SN:KI270508.1 LN:1951
@SQ SN:KI270509.1 LN:2318
@SQ SN:KI270510.1 LN:2415
@SQ SN:KI270511.1 LN:8127
@SQ SN:KI270512.1 LN:22689
@SQ SN:KI270515.1 LN:6361
@SQ SN:KI270516.1 LN:1300
@SQ SN:KI270517.1 LN:3253
@SQ SN:KI270518.1 LN:2186
@SQ SN:KI270519.1 LN:138126
@SQ SN:KI270521.1 LN:7642
@SQ SN:KI270522.1 LN:5674
@SQ SN:KI270528.1 LN:2983
@SQ SN:KI270529.1 LN:1899
@SQ SN:KI270530.1 LN:2168
@SQ SN:KI270538.1 LN:91309
@SQ SN:KI270539.1 LN:993
@SQ SN:KI270544.1 LN:1202
@SQ SN:KI270548.1 LN:1599
@SQ SN:KI270579.1 LN:31033
@SQ SN:KI270580.1 LN:1553
@SQ SN:KI270581.1 LN:7046
@SQ SN:KI270582.1 LN:6504
@SQ SN:KI270583.1 LN:1400
@SQ SN:KI270584.1 LN:4513
@SQ SN:KI270587.1 LN:2969
@SQ SN:KI270588.1 LN:6158
@SQ SN:KI270589.1 LN:44474
@SQ SN:KI270590.1 LN:4685
@SQ SN:KI270591.1 LN:5796
@SQ SN:KI270593.1 LN:3041
@SQ SN:KI270706.1 LN:175055
@SQ SN:KI270707.1 LN:32032
@SQ SN:KI270708.1 LN:127682
@SQ SN:KI270709.1 LN:66860
@SQ SN:KI270710.1 LN:40176
@SQ SN:KI270711.1 LN:42210
@SQ SN:KI270712.1 LN:176043
@SQ SN:KI270713.1 LN:40745
@SQ SN:KI270714.1 LN:41717
@SQ SN:KI270715.1 LN:161471
@SQ SN:KI270716.1 LN:153799
@SQ SN:KI270717.1 LN:40062
@SQ SN:KI270718.1 LN:38054
@SQ SN:KI270719.1 LN:176845
@SQ SN:KI270720.1 LN:39050
@SQ SN:KI270721.1 LN:100316
@SQ SN:KI270722.1 LN:194050
@SQ SN:KI270723.1 LN:38115
@SQ SN:KI270724.1 LN:39555
@SQ SN:KI270725.1 LN:172810
@SQ SN:KI270726.1 LN:43739
@SQ SN:KI270727.1 LN:448248
@SQ SN:KI270728.1 LN:1872759
@SQ SN:KI270729.1 LN:280839
@SQ SN:KI270730.1 LN:112551
@SQ SN:KI270731.1 LN:150754
@SQ SN:KI270732.1 LN:41543
@SQ SN:KI270733.1 LN:179772
@SQ SN:KI270734.1 LN:165050
@SQ SN:KI270735.1 LN:42811
@SQ SN:KI270736.1 LN:181920
@SQ SN:KI270737.1 LN:103838
@SQ SN:KI270738.1 LN:99375
@SQ SN:KI270739.1 LN:73985
@SQ SN:KI270740.1 LN:37240
@SQ SN:KI270741.1 LN:157432
@SQ SN:KI270742.1 LN:186739
@SQ SN:KI270743.1 LN:210658
@SQ SN:KI270744.1 LN:168472
@SQ SN:KI270745.1 LN:41891
@SQ SN:KI270746.1 LN:66486
@SQ SN:KI270747.1 LN:198735
@SQ SN:KI270748.1 LN:93321
@SQ SN:KI270749.1 LN:158759
@SQ SN:KI270750.1 LN:148850
@SQ SN:KI270751.1 LN:150742
@SQ SN:KI270752.1 LN:27745
@SQ SN:KI270753.1 LN:62944
@SQ SN:KI270754.1 LN:40191
@SQ SN:KI270755.1 LN:36723
@SQ SN:KI270756.1 LN:79590
@SQ SN:KI270757.1 LN:71251
@PG ID:minimap2 PN:minimap2 VN:2.24-r1122 CL:minimap2 -t 40 --MD -ax splice:hq -uf /projects/b1073/pipelines/commonref/GRCh38/GRCh38.primary_assembly.genome.fa /projects/b1073/RNAseq/InHouse_LR/IsoSeq_Pipeline/D60_brain_organoid/isoseq3_output/refine/m64403e_230116_171857.flnc.fasta
and here is a SAM record of one of the problematic alignments for one of the lines that errored with getFASTA:
# Line 31475 that errored with getFASTA
chr14 52699325 52716208 m64403e_230116_171857/90505969/ccs 60 + 52699325 52716208 255,0,0 3 78,0,381 0,404,16502
# That same read in the mapped SAM file (note: it also mapped to 2 other locations, one other location on chr14 and the other on chr7)
m64403e_230116_171857/90505969/ccs 2048 chr14 52699326 60 2511H39M1I9M4D26M326N237I16098N3D378M * 0 0 CTTGTACACTACTGGTGGGAATGTAAATTAGCACAGCCATTTTTGAAAACATGGAGGTTCCTCAAAAAACTAAAAATAGAATTACCATATGATTCAGCAATCCCACTTCTGGGTTTATATCTAAAGGAATTGAAATCAGTGTGTCAGAGATAGCTGCACTCCCATGATTATTTCACAATAGCCAAGATATAGAAACAGCCTAAAAATTGCCCATCAATGGATGAATGGATAAAGAAAATGTGGTAGCCGGGTGCAGTGGCTCATACCTGTAGTGCCAACACTTTGGGAGGCCGAGGCGGGCGGATCACCTGAGGTCTTGAATTCCTGGGCTCAAGCAACCCGCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCACAATGTCCAGCCACGGCAGCTTTCTAATATATTAATACTTAAAGACTTTTCTGATGAGATAAGTGGTGAGAATAACAAAAATTTTTTATAATGTGTGGTGGAAAATGTCAACATTTGGAAGATTTGCATAACTCAACCAGTAGTTTCCAAATAATCAATGCTTGATATTAAAATATTCATAAGTAAAAGATCCAGTCAGTGCACAGGATAGACCAATGTATTTTAATGTAACAGAAGTTTCTGTCATAGTCCATGTTGTAAGTAGATAGCTATTATAAAAAAGACAAAAGTGTTTGCAAGATGT * NM:i:249 ms:i:395 AS:i:334 nn:i:0 ts:A:+ tp:A:P cm:i:112 s1:i:358 s2:i:46 de:f:0.0175 SA:Z:chr7,47964109,-,1013S2187M13320D1S,60,30;chr14,52715721,+,2004S823M374S,60,0; MD:Z:10G0T19T16^ATAG0T25^GCT378 rl:i:499
Thank you, Mia
This example was very helpful. I just pushed a fix for the bug! If you clone this repository and compile from that, you should have a fix, but please let me know if you still have problems. https://github.com/arq5x/bedtools2
Hi Aaron,
Sorry for the late reply. I just tested the updated bamToBed function on the split chr14 BAM file and then ran getFASTA, and it worked without any errors! Here is output of the previously noted chr14 BED file line:
chr14 52699325 52716208 m64403e_230116_171857/90505969/ccs 60 + 52699325 52716208 255,0,0 2 78,381 0,16502
Now there are no exon blocks of zero length produced.
Thank you very much for all your help and time! I greatly appreciate it!
Best regards, Mia
Hello,
I have tried using bedtools getFASTA on large BED12 files, but I have noticed that the function does not complete and fails when it hits a line that creates this error:
My interpretation of this error message (and from others online) is that when the start coordinates are greater than the end coordinates, bedtools getFASTA will stop the conversion. To find the lines in the BED12 file that were causing this error, I created the following script to convert each line in the BED12 file line-by-line into a FASTA file:
From this script, I got the following 3 lines out of the total 2875674 that caused this error:
Using awk, I searched for these lines in the original BED12 file and got these lines where the start coordinates do seem to be less than the end coordinates:
Since these coordinates look valid, I was wondering why these lines are causing errors for getFASTA? Also, would it be possible to input some new code to handle this error so that if getFASTA runs into a line that creates this error, it doesn't convert that line, but continues to convert the rest of the file? The line-by-line getFASTA script that I created above took multiple hours to run, so it would be very convenient to have this new error handling since only a few lines out of the almost 3 million lines were unable to convert.
Thank you for your help in advance!
Best regards, Mia