Closed miquelduranfrigola closed 1 year ago
Hello @miquelduranfrigola. Thank you very much for this information, it is really valuable for me. I'll talk you by slack to organize the meet.
Hi @GemmaTuron and @miquelduranfrigola
A possible solution that was proposed was the following: Considering JSONL format, it is intended to write all the output of the JSON file in a single column. To make sure that this implementation won't fail because the outputs can be variable with respect to the length of the tree in the JSON. It is proposed to implement a function that can read the model output from the airtable, and if that output is JSON we can implement the solution with the JSONL format. For this, the following must be taken into account:
Predefine the types of outputs, to create a column that can be called output_shape, where the type of output for each model is defined. Which can be:
Implement the function to read from the airtable the output (the previously created column "output_shape") given a model.
Modify the function to_data_frame() : Depending on the output of the output_shape, if it is JSON, the JSONL format functionality is executed, and for other types of outputs it is executed as it was done before.
Hello @GemmaTuron and @miquelduranfrigola.
As stated above, I've been working on this problem using the JSONL format to serialize the JSON output to a csv. I was working on the implementation of the function that allows reading the output of the "output shape" field from the airtable, if it is an output shape equal to "Flexible list" (which is the case of the eos526j and eos5qfo models, with JSON output type ), we serialize the output to a string, and return the entire JSON string in a single column of the csv file. I have tested the eos526j model in my Cli, with a .csv type input file with two molecules and specifying the .csv file output. It runs without errors. I attach the output: ouput_eos526j.csv I'm still testing the functionality by passing the entire "eml_canonical.csv" file as input, but this takes a long time. (It's still running.)
I also ran the eos5qfo model, which also presented issues, this model runs on my machine successfully with the changes and I have also been able to have the output with the .csv extension successfully passing the eml_canonical as input. I attach the output for this model: output_eos5qfo_1.csv
As you can see in the output file, let's return all the JSon in a single column named "outcome".
I only have to make sure that the models that have tested for me successfully continue to test successfully in my CLI with the changes before opening a pull request.
Hi @miquelduranfrigola After trying to make the predictions for a few hours, the model throws the following error, indicating that the connection has been lost. log_predict_eos526j.txt
Hi @carcablop
This error is due to internet issues, not related to the Hub. have you tried again?
Hi @carcablop
Following up on this to make sure we close the issue during this week
Hi @GemmaTuron This model usually takes a long time to run, and possibly it was an internet connection problem, I'm going to run it again.
Hi @carcablop! Any updates?
Hello @GemmaTuron and @miquelduranfrigola. Sorry for my late reply, to test the eos526j model it is required to have a very stable internet connection for a long period. Additionally, I wanted to make sure that it wasn't going to cause errors. Here are my updates:
The model successfully fetch, serve and predict passing the entire eml_canonical file as input and printing the output in a .csv file. It takes about 6 hours to predict all entries in the eml_canonical.csv file. Log fetch: log_fetch_eos526j.txt
This does not generate errors, you can see in the predictions log that no error was generated, although checking the files that are generated when it is in the process of making the predictions (I have been able to see it in the tmp file on my system), I can see that an error is generated here:
eos526j2023-03-22 181959.746412 E tensorfl.txt
Regarding this, what I was able to investigate is that it may be because an invalid value is set for this CUDA_VISIBLE_DEVICES
environment variable so that it does not use the GPU, and this is what may be generating that error. The solution is to change it to CUDA_VISIBLE_DEVICES=0
. In the model's main code, I can see that it's set to CUDA_VISIBLE_DEVICE=-1
here: https://github.com/ersilia-os/eos526j/blob/main/model/framework/code/main.py#L8
we can change it to avoid this error.
Otherwise, the model runs successfully in my CLI with the changes and the output can be generated in a .csv file
Here I attach the output of the model:
I was able to fetch the model, serve and predict by passing it the entire em_canonical and printing the output to a .csv file. Particularly with this model, I found that when fetching errors like: "input is inconsistent and output is inconsistent", even though the fetch succeeded on my CLI. Here I attach the model fetch log:
Output .csv salida_eos5qfo.csv
Although the model makes the predictions and prints output in both .json and .csv formats, when it tries to make the predictions you can see errors being generated in the process. An error is thrown with TensorFlow when trying to load the pre-trained model raising an InvalidArgumentError
.
I attach a log of the errors generated in my temporary files when it makes the predictions:
run.log
I still don't quite understand the error being generated, and if this affects the results in the model's predictions. I tried to download the model and execute it only in the main.py but this generates other errors. If I also try to run this in google colab it also generates the same errors. Passing a single molecule as input, the model does not generate an output, it generates the following error:
assertionError_eos5qfo_1mol.txt
Other models with the output shape equal to Flexible list
, such as eos4q1a, and this model runs successfully and predicts without errors, both for one molecule and for all the eml_canonical files.
Other models with another Output shape can be successfully tested and do not generate errors.
For models that are still in progress and that the fields in the output shape airtable are empty, I added a try-except to the function to avoid the error when this field is empty and can be executed as it has been implemented and tested without errors.
The proposed implementation for the output adapter was well done. I tried to do several tests to make sure that this does not generate errors. Let me know if another test needs to be done.
Here are the changes: https://github.com/carcablop/ersilia/commit/05db137224290f370396432c2594f2f6d4f5c3d3
Thanks, I'll do the pull request.
This is the pull request: https://github.com/ersilia-os/ersilia/pull/654
Hi @carcablop !
This is fantastic thanks for solving that
I will close this issue!
Hello @carcablop !
Based on previous discussion, it will be good to serialize JSON outputs to CSV. @DhanshreeA has already investigated the issue a little bit. I have assigned it to you because you have expressed interest in it: #371
A while ago, I wrote some code that can help:
In particular, in (2) there is a
DictlistDataframeConverter
that can help. In (3), there is a very simple CSV dataloader that we could extend. Finally, in (4) we could consider writting something likedef _is_csv_serializable(...)
to check whether a given output could be expressed in a CSV (the majority of outputs will be thanks to the JSONL format as discussed)I would suggest that we organize a 1-1 meeting to talk about this and then we reflect the strategy in this issue thread.