Clean UP & Dockerization eos1vms

simrantan commented 1 year ago

@GemmaTuron I have been working on the error reported on this model for all of today - it seems there was some oddities in the code (checkpoints was being accessed as a sys.argv, the code does has a hard coded example molecule) that were causing functionality discrepancies. Once I found these bugs, I managed to fix them and maintain functionality. I then started working on figuring out why the model is returning chembl ids when fetched but not when run, and it appears that main,py does not write any output involving the chembl ids - it only processes the predicition scores for these targets and returns them. I am not sure why it was returning chembl ids when fetched, though I spent some time looking into how ersilia fetches models to see what might be causing this discrepancy. I started trying to edit main.py to implement returning the main target (as the model description says) instead of all prediction scores. I have been running into numerous bugs as I do so, but I am still working through it. I am getting close to a solution that will hopefully return the correct output and i will update once I have finally gotten this solution to work.

GemmaTuron commented 1 year ago

Hi @simrantan Good job! If possible, we should return the ChEMBL target + the prediction score. Indeed in main.py is where the model outputs are processed, so there should be where the changes are done

simrantan commented 1 year ago

Hi @GemmaTuron I have fixed the issue that was occuring while testing (KeyErrors will not occur any longer) but I am still working on getting the output to be the Chembl target and the prediction score. I have been attempting multiple methods but for some reason, the chembl targets never show up in the output. I tried changing the output from X to targets, which usually results in an error, I have changed the method of writing the output file from "w" to "a" to prevent the possibility of overwriting, etc. In my latest attempt, I switched from using csv to using pandas to see if it would work better -

data = pd.DataFrame(X, columns=desc.targets)

# Open the output file in append mode
with open(output_file, "a", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(desc.targets)  # Write the header

    # Append the DataFrame data to the output file
    data.to_csv(f, header=False, index=False)

However, the output still doesn't include any targets. I believe this is the discrepancy between the fetch showing the target names while run does not - for whatever reason, when fetching, this line :writer.writerow(desc.targets) works just fine to produce the "meta", but when running, it appears to be skipped over entirely and produce no output.

An additional possible issue is the output shape. Right now, it is a flexible list, and the output type is a string. However, if we are returning the target and the predicition score, is the output type a dictionary? or is the shape a dictionary? could this cause discrepancies in how the output file is written?

GemmaTuron commented 1 year ago

Hi @simrantan

Could you print the descs.targets to understand what they are? And also the data? If using the bash command from run.sh (which is simply calling main.py) works, the parsing issue will be on service.py. If I recall correctly, at fetch time it does work giving the outputs. It's not clear to me now if main.py is giving the outputs using the bash command, if it is not please add more print statements to follow step by step what is going on.

simrantan commented 1 year ago

Hi,

Thank you for the tips! Yes - I have printed desc.targets before (it is a list of the CHeMbl names like CHEMBL1234, etc. I will add print statements to continue to check what is happening in the code. When using ersilia run after fetching, this is the output: output1vs.csv So an output is being created without Keyerrors now. I will try running during bash to see if the issue is in service.py!

simrantan commented 1 year ago

thank you for the advice to use bash run.sh! I tested this and found main.py was producing the correct output. I spent some time looking at service.py and testing different changes to see where in the file the issue was happening. I knew it was in the "run" function, and I think what is occurring is that this code:

        with open(pred_file, "r") as f:
            reader = csv.reader(f)
            h = next(reader)
            R = []
            for r in reader:
                R += [{"scores": [float(x) for x in r]}]
        output = {
            'result': R,
            'meta': {'scores': h}
        }
        return output

Does not return the headers, skipping over them and including them as part of a "meta" dictionary that is for some reason cut off from the output when using "run". It is unclear from this code why the "meta" is not a part of any output files. While experimenting with changes, I found that changing R += [{"scores": [float(x) for x in r]}] to R+=[{"header": h, "scores": [float(x) for x in r]}] actually made a difference in the output file, printing out the headers(but no scores this time, so still a failure. I have been and am still working on trying methods to fix it, however I am continuing to run into malfunctions anytime I try to change the contents of the output dictionary. This is the error I am getting with any change to output:

10:04:33 | DEBUG    | Waiting for server
10:04:34 | DEBUG    | Trying to wake up. Iteration: 1
10:04:34 | DEBUG    | Timeout: 1000 Sleep time: 1
10:04:34 | DEBUG    | Temporary file available: /tmp/ersilia-ktzmeqq7/serve.log
10:04:34 | DEBUG    | No error strings found in temporary file
10:04:34 | DEBUG    | Server logging done
10:04:35 | DEBUG    | Trying to wake up. Iteration: 2
10:04:35 | DEBUG    | Timeout: 1000 Sleep time: 1
10:04:35 | DEBUG    | Temporary file available: /tmp/ersilia-ktzmeqq7/serve.log
10:04:35 | DEBUG    | No error strings found in temporary file
10:04:35 | DEBUG    | Server is ready. Trying to get URL
10:04:35 | DEBUG    | URL found: http://127.0.0.1:34891
10:04:35 | DEBUG    | Iterating over APIs
10:04:35 | DEBUG    | Running API: run
10:04:35 | DEBUG    | ['CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O', 'C1=CN=CC=C1C(=O)NN']
10:04:35 | DEBUG    | API: run
10:04:35 | DEBUG    | MODEL ID: eos1vms
10:04:35 | DEBUG    | SERVICE URL: http://127.0.0.1:34891
10:04:35 | DEBUG    | Reading card from eos1vms
10:04:35 | DEBUG    | Reading shape from eos1vms
10:04:35 | DEBUG    | Input Shape: Single
10:04:35 | DEBUG    | Input type is: compound
10:04:35 | DEBUG    | Input shape is: Single
10:04:35 | DEBUG    | Importing module: .types.compound
10:04:35 | DEBUG    | Checking RDKIT and other requirements necessary for compound inputs
10:04:35 | DEBUG    | InputShapeSingle shape: Single
10:04:36 | DEBUG    | API eos1vms:run initialized at URL http://127.0.0.1:34891
10:04:36 | DEBUG    | Schema not yet available
10:04:36 | INFO     | No empty output available
10:04:36 | DEBUG    | Meta: None
10:04:36 | DEBUG    | Posting to run
10:04:36 | DEBUG    | Batch size 100
10:04:36 | DEBUG    | Schema not yet available
10:04:39 | DEBUG    | Status code: 200
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

0

Sometimes, it also provides error message 1. Changing R also sometimes results in these errors - the error logs do not provide any further detail than these numbers. I have been checking this service.py against the template to see what this behavior is, but I have yet to find an explanation for why changing output is so critical. I am working on continuing to try different methods to include headers.

simrantan commented 1 year ago

@GemmaTuron I have got it to work!! The output is now the targets and the outcomes. The only issue is now a formatting one - for whatever reason, all the strings are double-quoted. Ex:

key,input,outcome
LUHMMHZLDLBAKX-UHFFFAOYSA-N,CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O,"{""header"": [""CHEMBL1075104"", ""CHEMBL1075110"", ""CHEMBL1075126"", ""CHEMBL1075138"", ""CHEMBL1075145"", ""CHEMBL1075189"", ""CHEMBL1075232"", ""CHEMBL1075317"", ""CHEMBL1163101"", ""CHEMBL1163125"", ""CHEMBL1250348"", ""CHEMBL1255150"", ""CHEMBL1287622"", ""CHEMBL1293224"", ""CHEMBL1293226"", ""CHEMBL1293231"", ""CHEMBL1293232"", ""CHEMBL1293237"", ""CHEMBL1293255"", ""CHEMBL1293277"", ""CHEMBL1293289"", ""CHEMBL1293299"", ""CHEMBL1741186"", ""CHEMBL1744525"", ""CHEMBL1781"", ""CHEMBL1782"", ""CHEMBL1784"", ""CHEMBL1785"", ""CHEMBL1790"", ""CHEMBL1792"", ""CHEMBL1795086"",

this is the code I wrote:

with open(pred_file, "r") as f:
             reader = csv.reader(f)
            h = next(reader)
            R = []
            for r in reader:
                entry = {"targets": h, "scores": [float(x) for x in r]}
                R.append(entry)
        output = {
            'result': R,
            'meta': {'scores': h}
        }
        shutil.rmtree(tmp_folder)
        return output

I'm not sure why the double quotes are there, but I am working on finding a solution if possible! Research right now shows that using json.dump is the safest method to fix a problem like this, but I am not sure if that is the ideal use in this case so I'm looking into alternatives that might work.

GemmaTuron commented 1 year ago

Hi @simrantan

Please can you attach the output you get running run.sh? Otherwise I can't help in the function that processes it.

simrantan commented 1 year ago

Hi, yes -

this is the output using bash: output1vms.csv

and this is the output using ersilia run: outputtest.csv

GemmaTuron commented 1 year ago

Hi Simran,

If you look at the files, you'll see first, it is duplicated? Then, does it always output all these targets and its probability or the targets change? I am not sure about this. It will tell us if the output is a List or a Flexible List. Once this is clear, you need to process this file in service.py so that the headers are kept as the chembl target and the rows indicate the probability. For that it would be better if we fixed the output to List, so can you try different sets of molecules and see how many results you get per run?

simrantan commented 1 year ago

Hi,

Yes - I think I may have misunderstood what was meant by adding chembl targets + predictions. the original functionality of the code was printing out the predictions for all targets, so what I added was the corresponding target as the header. However, I can change the functionality to only return the main targets and their scores! I will change the output to List to see what that does, and add the functionality so it only returns main targets and their scores.

simrantan commented 1 year ago

@GemmaTuron I have fixed the functionality so now it returns the main target and corresponding prediction score for each SMILE. this is the output of service.py now: outputcorrected1.csv

I am still getting the double quotes issue but the output is now correct.

GemmaTuron commented 1 year ago

perfect! Let's see if you can get rid of the double quotes, this should be a small issue in the processing of the output

simrantan commented 1 year ago

@GemmaTuron

I went back to returning the chembl targets and it is only running once based on the bash run results: output1vmsBash.csv So i think the issue we were looking at earlier was actually a processing issue that I am working on

simrantan commented 1 year ago

@GemmaTuron I have found the reason why the targets repeated: The only way I got the headers to show up was by including them in this loop:

            for r in reader:
                entry = {"outcome": [float(x) for x in r]}
                result["result"].append(entry)

which caused the headers for each x in , instead of once for the whole file. However, I have spent all day trying different methods to try and get the header to work and it is continuously failing. I am getting "Error: 0" a lot. And if the code does work, the output is always missing the header. No matter what method I tried, this was the result -

To see why it wasn't working, I created a python script to run the part of the code responsible for the output of service.py:


        with open(pred_file, "r") as f:
            reader = csv.reader(f)
            h = next(reader)
            result = {"meta": {"outcome": h}, "result": []}
            #R = []
            for r in reader:
                entry = {"outcome": [float(x) for x in r]}
                result["result"].append(entry)
                #R.append(entry)

        shutil.rmtree(tmp_folder)
        return result

and printed "result" as well as [result] (like how the service.py returns [output] at the end of the file like this -

@artifacts([Artifact("model")])
class Service(BentoService):
    @api(input=JsonInput(), batch=True)
    def run(self, input: List[JsonSerializable]):
        input = input[0]
        smiles_list = [inp["input"] for inp in input]
        output = self.artifacts.model.run(smiles_list)
        return [output]

and found that the targets ARE being printed, followed by the predictions. The output is exactly correct, so why is getting truncated in service.py? So this code is correct, and populates results with one instance of targets list as the header then the prediction score. I am now very confused about what the cause could be - if this code is correct, then where else the header be getting cut off when I use "ersilia run"?

GemmaTuron commented 1 year ago

Hi @simrantan

I'll need to play with the code to figure this out but I can't do it today. Please open a PR with the updated code even if it is not fully working so I can merge it and work from Ersilia side next week on this! sorry I can't provide more support today

simrantan commented 1 year ago

I will, thank you so much! I really appreciate the support

GemmaTuron commented 1 year ago

Hi @simrantan

I've worked on this model. Some comments:

The hard coded molecule is necessary to get the target list
The format of passing the checkpoints as sys.argv is an older format of the eos-template, I've refactored the paths Aside from that I haven't done many more changes, just a few in main.py, the model works fine and returns the expected probabilities. You can test it and let me know.

simrantan commented 1 year ago

Hi, Yes, I ended up leaving the hard coded molecule in as well. Thank you so much for all the support! I tested the model and it is giving me this error(this is the whole error log):

(ersilia) simran@DESKTOP-BKSEGAO:~$ ersilia run -i eos/eml_canonical.csv -o outputeos1vms.csv
Traceback (most recent call last):
  File "/home/simran/miniconda3/envs/ersilia/bin/ersilia", line 33, in <module>
    sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 138, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 115, in wrapper
    return_value = func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/ersilia/ersilia/cli/commands/run.py", line 34, in run
    result = mdl.run(input=input, output=output, batch_size=batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 143, in _method
    return self.api(api_name, input, output, batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 353, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/core/model.py", line 367, in api_task
    for r in result:
  File "/home/simran/ersilia/ersilia/core/model.py", line 194, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 320, in post
    input=unique_input, output=None, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/serve/api.py", line 296, in post_unique_input
    for res in self.post_amenable_to_h5(input, output, batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 249, in post_amenable_to_h5
    input=todo_input, output=todo_output, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/serve/api.py", line 130, in post_only_calculations
    self._post(input, subfile)
  File "/home/simran/ersilia/ersilia/serve/api.py", line 95, in _post
    result = self._do_post(input, output)
  File "/home/simran/ersilia/ersilia/serve/api.py", line 84, in _do_post
    result_ = self.output_adapter.refactor_response(result_)
  File "/home/simran/ersilia/ersilia/io/output.py", line 122, in refactor_response
    m = self._nullify_meta(m, r)
  File "/home/simran/ersilia/ersilia/io/output.py", line 108, in _nullify_meta
    one_output = random.choice(result)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/random.py", line 261, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

I am not sure why this is - I also got this error when I ran it again:

(ersilia) simran@DESKTOP-BKSEGAO:~$ ersilia run -i eos/eml_canonical.csv -o outputeos1vms.csv
Traceback (most recent call last):
  File "/home/simran/miniconda3/envs/ersilia/bin/ersilia", line 33, in <module>
    sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 138, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 115, in wrapper
    return_value = func(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/ersilia/ersilia/cli/commands/run.py", line 34, in run
    result = mdl.run(input=input, output=output, batch_size=batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 143, in _method
    return self.api(api_name, input, output, batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 353, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/core/model.py", line 367, in api_task
    for r in result:
  File "/home/simran/ersilia/ersilia/core/model.py", line 194, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 320, in post
    input=unique_input, output=None, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/serve/api.py", line 296, in post_unique_input
    for res in self.post_amenable_to_h5(input, output, batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 262, in post_amenable_to_h5
    done_input, todo_input, done_output, todo_output
  File "/home/simran/ersilia/ersilia/serve/api.py", line 208, in _process_done_todo_results
    yield todo_output_data[i]
IndexError: list index out of range

I am not sure if i am running the command line arguement correctly? it is the same format as the one i ran for eos8fth but this time it seems to be causing issues. Let me know if there is anything more I should do to debug this.

GemmaTuron commented 1 year ago

Hi @simrantan !

Could you run it in -v mode? Also, did you pull the latest Ersilia code before running the models?

simrantan commented 1 year ago

Hi,

I did have the latest ersilia code! I re-downloaded ersilia again just to make sure, and re-ran the model with -v, and got the same error: errorlog1vms.txt

GemmaTuron commented 1 year ago

Hi @simrantan

From the above error I see this message:

> Detailed error:
> Input file eml_canonical.csv does not exist

It seems you are not passing the right path to the predict.

simrantan commented 1 year ago

My bad!

I ran it again with the right file and am still getting the error:

(ersilia) simran@DESKTOP-BKSEGAO:~$ ersilia run -i eos/eml_canonical.csv -o output1vms.csv
/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
  File "/home/simran/miniconda3/envs/ersilia/bin/ersilia", line 33, in <module>
    sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simran/miniconda3/envs/ersilia/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/simran/ersilia/ersilia/cli/commands/__init__.py", line 22, in wrapper
    return func(*args, **kwargs)
  File "/home/simran/ersilia/ersilia/cli/commands/run.py", line 34, in run
    result = mdl.run(input=input, output=output, batch_size=batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 144, in _method
    return self.api(api_name, input, output, batch_size)
  File "/home/simran/ersilia/ersilia/core/model.py", line 354, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/core/model.py", line 368, in api_task
    for r in result:
  File "/home/simran/ersilia/ersilia/core/model.py", line 195, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 320, in post
    input=unique_input, output=None, batch_size=batch_size
  File "/home/simran/ersilia/ersilia/serve/api.py", line 296, in post_unique_input
    for res in self.post_amenable_to_h5(input, output, batch_size):
  File "/home/simran/ersilia/ersilia/serve/api.py", line 262, in post_amenable_to_h5
    done_input, todo_input, done_output, todo_output
  File "/home/simran/ersilia/ersilia/serve/api.py", line 208, in _process_done_todo_results
    yield todo_output_data[i]
IndexError: list index out of range

This is the -v version: eos1vmserror2.txt

It seems the error occurs at the "returning and rearranging" step so it may be from service.py

GemmaTuron commented 1 year ago

Hi @simrantan I don't know why you are getting this, I see you are fetching from repo_path, are you sure you have the latest version of the model and of ersilia pulled? I have tested again and it works fine for me both fetching from Docker and from GitHub eml_out.csv

ersilia-os / eos1vms

Clean UP & Dockerization eos1vms #1