Mordred Descriptors failing to obtain data type

GemmaTuron commented 1 month ago

Hi @miquelduranfrigola and @DhanshreeA

Quite urgent. The Mordred descriptor is used in ZairaChem, but when I run the pipeline, it fails to use them. I think the error is somewhere on the model metadata, as this is the output I am getting. It does calculate the descriptor metadata but then the outcome is None - is the issue in the service.py file, and could you have a quick look?

17:48:44 | DEBUG    | Matching for input is [1]
17:48:44 | DEBUG    | Has header True
17:48:44 | DEBUG    | Schema {'input': [1], 'key': None}
17:48:44 | DEBUG    | Standardizing input single
17:48:44 | DEBUG    | Writing standardized input to /tmp/ersilia-inn6sgmf/standard_input_file.csv
17:48:44 | DEBUG    | Reading standard file from /tmp/ersilia-inn6sgmf/standard_input_file.csv
17:48:44 | DEBUG    | Schema available in /home/gturon/eos/dest/eos78ao/api_schema.json
17:48:58 | DEBUG    | Status code: 200
17:48:58 | DEBUG    | Schema available in /home/gturon/eos/dest/eos78ao/api_schema.json
17:49:12 | DEBUG    | Status code: 200
17:49:25 | DEBUG    | Status code: 200
17:49:39 | DEBUG    | Status code: 200
17:49:49 | DEBUG    | Status code: 200
17:49:49 | DEBUG    | Done with unique posting
17:49:50 | DEBUG    | Data: outcome
17:49:50 | DEBUG    | Values: [16.832930485447708, 15.132282677982634, 0.0, 0.0, 26.694651226242218, 2.535927912857621, 5.07185582
17:49:50 | DEBUG    | Getting pure dtype for outcome
17:49:50 | DEBUG    | This is the pure datatype: None
17:49:51 | DEBUG    | Guessed pure datatype: None
17:49:51 | DEBUG    | Guessed absent pure datatype: None
17:49:51 | DEBUG    | Datatype: None
17:49:51 | DEBUG    | Guessed pure datatype: None
17:49:51 | DEBUG    | [16.832930485447708, 15.132282677982634, 0.0, 0.0, 26.694651226242218, 2.535927912857621, 5.071855825715241, 26.694651226242218 ... ]
17:49:51 | DEBUG    | None
17:49:51 | DEBUG    | outcome

17:49:52 | DEBUG    | None
17:49:52 | DEBUG    | outcome

FYI I am using the Docker version of the model

GemmaTuron commented 1 month ago

oh no @miquelduranfrigola

The model is only having this behaviour when run inside ZairaChem. Using the Ersilia installed in the same environment but as a standalone:

18:10:00 | DEBUG    | Data: outcome
18:10:00 | DEBUG    | Values: [26.605158585931452, 21.16339214179242, 3.0, 1.0, 42.87062313866525, 2.572597677509152, 5.1445266340
18:10:00 | DEBUG    | Getting pure dtype for outcome
18:10:00 | DEBUG    | This is the pure datatype: numeric_array
18:10:00 | DEBUG    | Datatype: numeric_array
18:10:00 | DEBUG    | Datatype has been matched: numeric_array over {'mixed_array', 'array', 'numeric_array', 'string_array'}
18:10:00 | DEBUG    | No merge key
18:10:00 | DEBUG    | [26.605158585931452, 21.16339214179242, 3.0, 1.0, 42.87062313866525, 2.572597677509152, 5.1445266340153015, ...]
18:10:00 | DEBUG    | numeric_array
18:10:00 | DEBUG    | outcome

GemmaTuron commented 1 month ago

Ersilia is v0.1.34 and cannot be upgraded btw

GemmaTuron commented 1 month ago

Could it be because we are using the Ersilia API instead of the CLI in ZairaChem? I'll test it

GemmaTuron commented 1 month ago

FYI @DhanshreeA and @Abellegese this issue does not happen when using the Ersilia Python API as a standalone (outside ZairaChem). I really do not understand what is going on, but it seems this was already reported and fixed, can you confirm @DhanshreeA what was the issue and how was it fixed?

from ersilia import ErsiliaModel
em = ErsiliaModel("eos78ao")
em.api(input="test.csv", output="out.csv")

Abellegese commented 1 month ago

Hi @GemmaTuron I will take and inform you.

GemmaTuron commented 1 month ago

@miquelduranfrigola quick quetion as well. Do you think it is because the datatype is identified as Null that the Mordred descriptors are not being processed, or the reason is another? The raw.h5 file IS created, but then the pipeline breaks.

I would not link one issue to the other:

@DhanshreeA please confirm the None error is fine in other versions of Ersilia @Abellegese just add the python api tests as we discussed but do not lose too mich time in investigating this

@miquelduranfrigola you and I should do a deep dive in ZairaChem soon and fix those issues.

miquelduranfrigola commented 1 month ago

Thanks @GemmaTuron, @DhanshreeA and @Abellegese My immediate reaction would be that we work on making ZairaChem compatible with the latest Ersilia version (if it isn't, yet), so we can at least reflect the changes we make in Ersilia in ZairaChem. Also, the None issue with the metadata was usually just fine and it was resolved dynamically by inspecting the data (generally, not sure about this case in particular). To me, what is happening is that Mordred is giving too many NaN values and then the data type resolver fails. On a possibly related note, I have noticed that Mordred tends to give more NaN values with the latest Numpy versions, which is absolutely critical. So, can you confirm which is the Numpy version that is being used to run Mordred? That is, in the eos78ao conda environment.

ersilia-os / eos78ao

Mordred Descriptors failing to obtain data type #12