CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
124 stars 29 forks source link

Composite units #5

Open rich970 opened 2 years ago

rich970 commented 2 years ago

Hi.

I've been running into an issue where the Chemwordtokenizer() class splits composite units around the '/' e.g. speed unit: 'm/s' becomes 'm' + '/' + 's'.

I'm not sure if this was intentional but it can cause problems when trying to recognise composite units in text.

It's not clear from the documentation whether a different tokenizer should be used in these circumstances or I should be defining the units differently within the model.

My workaround has been to comment out the exception for forward slashes (lines 713 - 716) in the tokenize module.

All the best, Rich

ti250 commented 2 years ago

Hi Rich,

With composite units like speed where the units are a combination of multiple components and are expressed as equations of those components, such as m/s or ms-1, the parse expressions are automatically constructed and the creation of a units_dict is unnecessary. The units_dict is only required if you have some different “composite” unit, as is the case with energy where we don’t write things as kgm2s-2, but rather as Joules. In the case of speed, we only need to add to the units_dict if we had some new unit e.g. we wanted to write speeds in terms of c. The benefit of doing things this way is that you don’t need to think about all the different ways in which people can write units (e.g. m/s, ms-1, km/h) as the system can handle all combinations.

The practical result of this is that in the case of speed, ChemDataExtractor will extract speeds correctly with just the following definition:

class Speed(Dimension):
    constituent_dimensions = Length() / Time()

class SpeedModel(QuantityModel):
    dimensions = Speed()

I'll update the documentation to make this clearer.

rich970 commented 2 years ago

Thanks - that makes total sense!