cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
166 stars 33 forks source link

More false negatives #103

Open bernt-matthias opened 2 days ago

bernt-matthias commented 2 days ago

Was experimenting a bit with puremagic. Unfortunately already the first two tests did not work (but file did it's job). grib might just be missing, but H5 should be detected, or?

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/test/test.mz5

python -m puremagic lib/galaxy/datatypes/test/test.mz5 
'lib/galaxy/datatypes/test/test.mz5' : could not be Identified
file lib/galaxy/datatypes/test/test.mz5
lib/galaxy/datatypes/test/test.mz5: Hierarchical Data Format (version 5) data

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/test/test.grib

python -m puremagic lib/galaxy/datatypes/test/test.grib
'lib/galaxy/datatypes/test/test.grib' : could not be Identified
file lib/galaxy/datatypes/test/test.grib 
Gridded binary (GRIB) version 1
cdgriffith commented 2 days ago

Thanks for reporting, never heard of either of these types before!

For MZ5 I see the standard from nasa is returning 404 https://earthdata.nasa.gov/esdis/eso/standards-and-references/hdf-eos5 there is also information on it here https://docs.ogc.org/is/18-043r3/18-043r3.html but no mention of magic numbers.

Opening the file itself, starts with ‰HDF so can probably use that with low accuracy. Do you have any more examples of these file types I could look through?

cdgriffith commented 2 days ago

Pulled down that repo and looked in the folder with the example files. Compared to file there are 25 file types that puremagic does not have matches for, removing ones from file that are only reported as ASCII, data, or very short file.

.h5
.model
.biom2
.cool
.grib
.mcool
.vcf
.sam
.loom
.h5ad
.h5mlm
.nii2
.gpr
.npy
.rma6
.cel
.bcf_uncompressed
.mztab2
.parquet
.ptkscmp
.iqtree
.mz5
.fcs
.hdt
.gal

I will start looking into each of those and seeing if they have magic numbers associated with them we can add to pure magic.

Thank you for raising this issue, and supplying the great source of example files!