ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
220 stars 147 forks source link

Output adapter serialize JSON to CSV #560

Closed miquelduranfrigola closed 1 year ago

miquelduranfrigola commented 1 year ago

Hello @carcablop !

Based on previous discussion, it will be good to serialize JSON outputs to CSV. @DhanshreeA has already investigated the issue a little bit. I have assigned it to you because you have expressed interest in it: #371

A while ago, I wrote some code that can help:

  1. https://github.com/ersilia-os/ersilia/blob/master/ersilia/io/dataframe.py
  2. https://github.com/ersilia-os/ersilia/blob/master/ersilia/io/output.py
  3. https://github.com/ersilia-os/ersilia/blob/master/ersilia/utils/csvfile.py
  4. https://github.com/ersilia-os/ersilia/blob/master/ersilia/serve/schema.py

In particular, in (2) there is a DictlistDataframeConverter that can help. In (3), there is a very simple CSV dataloader that we could extend. Finally, in (4) we could consider writting something like def _is_csv_serializable(...) to check whether a given output could be expressed in a CSV (the majority of outputs will be thanks to the JSONL format as discussed)

I would suggest that we organize a 1-1 meeting to talk about this and then we reflect the strategy in this issue thread.

carcablop commented 1 year ago

Hello @miquelduranfrigola. Thank you very much for this information, it is really valuable for me. I'll talk you by slack to organize the meet.

carcablop commented 1 year ago

Hi @GemmaTuron and @miquelduranfrigola

A possible solution that was proposed was the following: Considering JSONL format, it is intended to write all the output of the JSON file in a single column. To make sure that this implementation won't fail because the outputs can be variable with respect to the length of the tree in the JSON. It is proposed to implement a function that can read the model output from the airtable, and if that output is JSON we can implement the solution with the JSONL format. For this, the following must be taken into account:

  1. Predefine the types of outputs, to create a column that can be called output_shape, where the type of output for each model is defined. Which can be:

    • Multiple columns -single -JSON.
  2. Implement the function to read from the airtable the output (the previously created column "output_shape") given a model.

  3. Modify the function to_data_frame() : Depending on the output of the output_shape, if it is JSON, the JSONL format functionality is executed, and for other types of outputs it is executed as it was done before.

carcablop commented 1 year ago

Hello @GemmaTuron and @miquelduranfrigola.

As stated above, I've been working on this problem using the JSONL format to serialize the JSON output to a csv. I was working on the implementation of the function that allows reading the output of the "output shape" field from the airtable, if it is an output shape equal to "Flexible list" (which is the case of the eos526j and eos5qfo models, with JSON output type ), we serialize the output to a string, and return the entire JSON string in a single column of the csv file. I have tested the eos526j model in my Cli, with a .csv type input file with two molecules and specifying the .csv file output. It runs without errors. I attach the output: ouput_eos526j.csv I'm still testing the functionality by passing the entire "eml_canonical.csv" file as input, but this takes a long time. (It's still running.)

I also ran the eos5qfo model, which also presented issues, this model runs on my machine successfully with the changes and I have also been able to have the output with the .csv extension successfully passing the eml_canonical as input. I attach the output for this model: output_eos5qfo_1.csv

As you can see in the output file, let's return all the JSon in a single column named "outcome".

I only have to make sure that the models that have tested for me successfully continue to test successfully in my CLI with the changes before opening a pull request.

carcablop commented 1 year ago

Hi @miquelduranfrigola After trying to make the predictions for a few hours, the model throws the following error, indicating that the connection has been lost. log_predict_eos526j.txt

GemmaTuron commented 1 year ago

Hi @carcablop

This error is due to internet issues, not related to the Hub. have you tried again?

GemmaTuron commented 1 year ago

Hi @carcablop

Following up on this to make sure we close the issue during this week

carcablop commented 1 year ago

Hi @GemmaTuron This model usually takes a long time to run, and possibly it was an internet connection problem, I'm going to run it again.

ana42742 commented 1 year ago

Hi @carcablop! Any updates?

carcablop commented 1 year ago

Hello @GemmaTuron and @miquelduranfrigola. Sorry for my late reply, to test the eos526j model it is required to have a very stable internet connection for a long period. Additionally, I wanted to make sure that it wasn't going to cause errors. Here are my updates:

  1. Testing the eos526j model with the proposed modifications for the output adapter:

The model successfully fetch, serve and predict passing the entire eml_canonical file as input and printing the output in a .csv file. It takes about 6 hours to predict all entries in the eml_canonical.csv file. Log fetch: log_fetch_eos526j.txt

This does not generate errors, you can see in the predictions log that no error was generated, although checking the files that are generated when it is in the process of making the predictions (I have been able to see it in the tmp file on my system), I can see that an error is generated here: eos526j2023-03-22 181959.746412 E tensorfl.txt Regarding this, what I was able to investigate is that it may be because an invalid value is set for this CUDA_VISIBLE_DEVICESenvironment variable so that it does not use the GPU, and this is what may be generating that error. The solution is to change it to CUDA_VISIBLE_DEVICES=0. In the model's main code, I can see that it's set to CUDA_VISIBLE_DEVICE=-1 here: https://github.com/ersilia-os/eos526j/blob/main/model/framework/code/main.py#L8 we can change it to avoid this error. Otherwise, the model runs successfully in my CLI with the changes and the output can be generated in a .csv file Here I attach the output of the model:

output_eos526j.csv

  1. Testing the eos5qfo model with the proposed modifications.

I was able to fetch the model, serve and predict by passing it the entire em_canonical and printing the output to a .csv file. Particularly with this model, I found that when fetching errors like: "input is inconsistent and output is inconsistent", even though the fetch succeeded on my CLI. Here I attach the model fetch log:

log_fetch_eos5qfo.txt

Output .csv salida_eos5qfo.csv

Although the model makes the predictions and prints output in both .json and .csv formats, when it tries to make the predictions you can see errors being generated in the process. An error is thrown with TensorFlow when trying to load the pre-trained model raising an InvalidArgumentError. I attach a log of the errors generated in my temporary files when it makes the predictions: run.log

run.log

I still don't quite understand the error being generated, and if this affects the results in the model's predictions. I tried to download the model and execute it only in the main.py but this generates other errors. If I also try to run this in google colab it also generates the same errors. Passing a single molecule as input, the model does not generate an output, it generates the following error:

assertionError_eos5qfo_1mol.txt

  1. Testing other models with the changes:

Other models with the output shape equal to Flexible list, such as eos4q1a, and this model runs successfully and predicts without errors, both for one molecule and for all the eml_canonical files. Other models with another Output shape can be successfully tested and do not generate errors. For models that are still in progress and that the fields in the output shape airtable are empty, I added a try-except to the function to avoid the error when this field is empty and can be executed as it has been implemented and tested without errors.

The proposed implementation for the output adapter was well done. I tried to do several tests to make sure that this does not generate errors. Let me know if another test needs to be done.

Here are the changes: https://github.com/carcablop/ersilia/commit/05db137224290f370396432c2594f2f6d4f5c3d3

Thanks, I'll do the pull request.

carcablop commented 1 year ago

This is the pull request: https://github.com/ersilia-os/ersilia/pull/654

GemmaTuron commented 1 year ago

Hi @carcablop !

This is fantastic thanks for solving that

GemmaTuron commented 1 year ago

I will close this issue!