bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
59 stars 9 forks source link

Error: malformed BED entry at line xxx. Start Coordinate detected that is < 0. #48

Open minw2828 opened 1 week ago

minw2828 commented 1 week ago

Hello,

Thank you for developing the tool.

I can see my issue is similar to #20, but I don't have patch sequences in my reference genome.

Could you advise what other reason might have caused this error please?

My error message:

raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:
bedtools sort -i straglr_tmp/tmpplaw7wyh.bed
Error message was:
Error: malformed BED entry at line 84281. Start Coordinate detected that is < 0. Exiting.

My reference genome only has chromosomes 1 to 22, X, Y and M.

Many thanks, Min

readmanchiu commented 1 week ago

I guess you may first want to check if the alignment bam and the genome fasta you provided for Straglr both used the same chromosome name convention - without the "chr" prefix. Can you show me the full command? and which version you were using? And if you specify --tmpdir to a specific directory and run with --debug, we can locate the "malformed" line in the BED file based on the error message.

minw2828 commented 4 days ago

Hello @readmanchiu,

Thank you for your quick response.

I split the genome into different chunks that were named ~{region_bed}, so straglr could process them concurrently.

The command that I ran was:

  python /usr/local/bin/straglr.py \
    --regions ~{region_bed} \
    --min_ins_size 3 \
    --nprocs ~{threads} \
    --tmpdir ~{region_name + "_" + pname + "_straglr_tmp"} \
    ~{bam} ~{ref_fasta} ~{region_name + "_" + pname + "_straglr"}

The same command was passed through five individuals. Of those, straglr ran through two individuals successfully, but the remaining three individuals hit the same error:

The first individual:

Error: malformed BED entry at line 59197. Start Coordinate detected that is < 0. Exiting.

The second individual:

Error: malformed BED entry at line 9899. Start Coordinate detected that is < 0. Exiting.

The third individual:

Error: malformed BED entry at line 58727. Start Coordinate detected that is < 0. Exiting.

Hence, the error was not caused by different chromosome name conventions.

I am thinking of two possible causes:

  1. Insufficient memory which usually threw out odd errors.
  2. The repeats being genotyped might have a start or end coordinate that is beyond the definition of the chromosomes.

Would reason 2 be possible?

I am keen to hear your thoughts on this.

Many thanks, Min

ljohansson commented 23 hours ago

I was wondering what are the respective lines of the different bed files?

readmanchiu commented 14 hours ago

--min_ins_size of 3 is a bit too much. Just a reminder that the unit for --min_ins_size is bp, not copy number. I think some insertions are picked up near the end of chromsomes so negative coordinates are generated when flank sizes are taken into account. I usually used 100 for --min_ins_size as ONT reads can be quite noisy. Also I usually skip centromeres or long repeat/segdups (which can be curated from UCSC annotation tracks) in genome scans by passing the coordinates to --exclude