Smiles importer test for improvements

FelixBaensch / MORTAR

MOlecule fRagmenTAtion fRamework

MIT License

18 stars 3 forks source link

Smiles importer test for improvements #31

Closed JonasSchaub closed 5 months ago

JonasSchaub commented 7 months ago

@SamuelBehr what was the purpose of this branch / these changes? Are these changes worth keeping/merging? Do you still need to do sth here?

FelixBaensch commented 6 months ago

Please relax the restriction, why should a file be limited to 2 or 3 entries per line. And check more (maybe 10) lines to identify the delimiter.

JonasSchaub commented 6 months ago

@FelixBaensch

Please relax the restriction, why should a file be limited to 2 or 3 entries per line.

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

And check more (maybe 10) lines to identify the delimiter.

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

SamuelBehr commented 6 months ago

Raising the number of lines to be checked when identifying the delimiter would also statistically increase the risk of falsely identifying a file as SMILES file or of identifying the false delimiter. It would also increase the time consumption the delimiter identification could take in a worst case scenario. So I would not raise this number to infinity ...

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

FelixBaensch commented 6 months ago

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

Sounds good

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

I am not talking about a header, but about a large file with many SMILES and an undefined number of lines that do not contain any parsable SMILES. However, the valid SMILES should still be imported. I think testing only the first 3 of 500 lines, i.e. not even 1%, is too small. What is the problem with testing the first 10? In terms of time, this should have no influence.

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

As mentioned above I don't see any problems with non-parsable SMILES.

JonasSchaub commented 5 months ago

Still to do:

look at / fix SonarCloud issues
think about how to report faulty molecules that could not be parsed from the input file.

JonasSchaub commented 5 months ago

@FelixBaensch and @SamuelBehr , if you have time and motivation, you are welcome to re-review these changes. Otherwise, please simply approve, thank you.

sonarcloud[bot] commented 5 months ago

Quality Gate passed

Issues
0 New issues
16 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud