BioJulia / BED.jl

MIT License
6 stars 5 forks source link

Last Bed entry not being read #25

Open abhinavsns opened 1 month ago

abhinavsns commented 1 month ago

On the develop branch 1b51e2d3e6ece2e5adf1ac7274246ee57fa9a81a

Consider example.bed file with 1 entry and 1 line:

chr1 1 5

Not that there is no empty line at the end

regions = open(BED.Reader, "example.bed") do reader
    IntervalCollection(reader, true)
end

collect(regions)

Output is an empty interval collection:

IntervalCollection{BED.Record} with 0 intervals:
Interval{BED.Record}[]

whereas if there is an empty line present in example.bed (1 entry, 2 lines):

chr1 1 5

Then the output is as expected:

IntervalCollection{BED.Record} with 1 intervals:
  chr1:2-5  .  chr1     1       5

1-element Vector{Interval{BED.Record}}:
 Interval{BED.Record}:
  sequence name: chr1
  leftmost position: 2
  rightmost position: 5
  strand: .
  metadata: chr1        1       5

It seems like the last entry of a file is not being read.

jonathanBieler commented 1 month ago

As I understand that's part of the specifications :

line: String terminated by a line separator, in one of the following classes. Either a data line,
a comment line, or a blank line. Discussed more fully in subsection 1.4.

line separator: Either carriage return (\r, equivalent to \x0d), newline (\n, equivalent to \x0a), or carriage return followed by newline (\r\n, equivalent to \x0d\x0a). The same line separator must be used throughout the file.

https://samtools.github.io/hts-specs/BEDv1.pdf

How does python/R handle it ?

abhinavsns commented 1 month ago

I do not know how python/R handles it but I see that the standard indeed requires that. The standard seems counterintuitive. Are there any non-trivial advantages of such a design?