alan-turing-institute / empiarreader

Reader for EMPIAR datasets
BSD 3-Clause "New" or "Revised" License
8 stars 4 forks source link

Some EMPIAR entries break #22

Closed ots22 closed 1 year ago

ots22 commented 1 year ago

See #20.

Accessing EMPIAR entry 10340 with EmpiarSource appears not to load any entries.

ds = EmpiarSource(
        10340,
        directory="data/Micrographs/Case1",
        filename_regexp="FoilHole_21756216_Data_21768957_21768958_20181221_2054-115048\\.mrc",
    )

ds.read_partition(0)

Raises StopIteration followed by IndexError (indicating the datasource is empty).

(The . in the regular expression is escaped as \\. so it isn't interpreted as 'any character' - but makes no difference to the result)

@JatGreer is this what you observed?

JatGreer commented 1 year ago

@ots22 sorry for the delay on this, yes that's exactly what observed. And thanks - I somehow missed that regex is required in that field! Initially I just tried changing . to \\. as it was the only regex-critical I thought was in there but that still failed in the same way. After copying the regex from the test with 10943 (ie: using filename_regexp=".*115048\\.mrc") the test succeeds instead. So looks like I'm just bad at regex!

I suspect a lot of the intended/potential userbase aren't regex experts though, so I'd like to float an idea:

If possible, I believe this would make empiarreader more accessible to those new to coding/python. Also, it might be nice to throw a more easily interpretable error upon this failure, but I'm unsure of the optimal way to implement that. Let me know your thoughts!

ots22 commented 1 year ago

Ah good spot. Apparently the full matching filename is https://ftp.ebi.ac.uk/empiar/world_availability/10340/data/Micrographs/Case1/FoilHole_21756216_Data_21768957_21768958_20181221_2054-115048.mrc, which is why the .* is needed. Full path could be useful for matching across subdirectories, but the common case is probably just the basename.

Agreed about having the option to have a mode for string matching (or simple wildcards), and the error message. I think read_partition(0) might also be a bit too low-level for typical user facing code anyway, but a full read() is expensive.