airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Initial moves to OpenAPI v3 #767

Open schristley opened 4 months ago

schristley commented 4 months ago

Start building the base so we only need to edit the OpenAPI v3 spec, and the v2 spec can be generated from v3. Also clean up the test suites so there is one common set of test data files used by all languages.

Restarted as I messed up the master merge with the old PR #758

schristley commented 4 months ago
$ make

Helper commands for AIRR Standards repository

make gen-v2       -- Generate OpenAPI V2 spec from the V3 spec
make build-docs   -- Build documentation
make spec-copy    -- Copy spec files to language directories
make data-copy    -- Copy test data files to language directories
make checks       -- Run consistency checks on spec files
make tests        -- Run all language test suites
make python-tests -- Run Python test suite
make r-tests      -- Run R test suite
make js-tests     -- Run Javascript test suite
schristley commented 4 months ago

@javh @bussec Are we allowing NA to be in the rearrangement TSV? I'm reconciling the test data and for the bad_rearrangement.tsv file, the R version has an NA while the python version does not. If I try to use the R version with the NA then python crashes:

Traceback (most recent call last):
  File "/work/lang/python/tests/test_interface.py", line 59, in test_load_rearrangement
    result = airr.load_rearrangement(self.rearrangement_bad)
  File "/work/lang/python/airr/interface.py", line 103, in load_rearrangement
    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 4

it seems even if we don't accept NA, this is maybe a python bug?

schristley commented 4 months ago

Catching the exception prevents the crash, but I'm not sure if that is what we want, or if we want pandas to allow NA and transform to None.

diff --git a/lang/python/airr/interface.py b/lang/python/airr/interface.py
index 590c3a7..07b194b 100644
--- a/lang/python/airr/interface.py
+++ b/lang/python/airr/interface.py
@@ -100,10 +100,15 @@ def load_rearrangement(filename, validate=False, debug=False):
     # TODO: test pandas.DataFrame.read_csv with converters argument as an alterative
     schema = RearrangementSchema

-    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
-                     dtype=schema.pandas_types(), true_values=schema.true_values,
-                     false_values=schema.false_values)
-    # added to use RearrangementReader without modifying it:
+    try:
+        df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
+                         dtype=schema.pandas_types(), true_values=schema.true_values,
+                         false_values=schema.false_values)
+        # added to use RearrangementReader without modifying it:
+    except Exception as e:
+        sys.stderr.write('Error occurred while loading AIRR rearrangement file: %s\n' % e)
+        return None
+
     buffer = StringIO()  # create an empty buffer
     df.to_csv(buffer, sep='\t', index=False)  # fill buffer
     buffer.seek(0)  # set to the start of the stream
javh commented 4 months ago

The R library will accept "" (empty string), NA, or None for null values. Though, the spec officially only recognizes an empty string as a null value.

I think allowing NA to equate to None in python would be fine (though NA is valid amino acid sequence), but I think it's less of a python bug and more of an invalid bad_rearrangement.tsv file... It is supposed to be "bad", I guess. But, how bad?

schristley commented 4 months ago

It is supposed to be "bad", I guess. But, how bad?

bad, but no so bad it causes a crash!

It isn't so much about what to test in the "bad" file. I'm more worried that in a "good" file there is an NA and R accepts it, but python doesn't, and/or we get incompatibility where an R output file cannot be fed into python because it has NAs.

bcorrie commented 4 months ago

The R library will accept "" (empty string), NA, or None for null values. Though, the spec officially only recognizes an empty string as a null value.

I would suggest that on input from an AIRR file, NA/None should not be interpreted as null and this should be rejected. It is non compliant if the data has NA/None for null, no?

Also on output, it should never output NA/None for null, it should always output an empty string.

javh commented 4 months ago

Yeah, it's true that the files are non-compliant if they include NA/None. And the R and python libraries do output empty string for NA/None values.

But, NA/None tend to be the default outputs from TSV writers outside the airr reference libraries. So, it's a compromise to deal with typical TSV output.

schristley commented 3 months ago

Sounds like there are two things here. 1) change the test so it works for both R and python. That should be easy then. 2) python and R need more support for null-like values, for cross-language interoperability of AIRR TSV. That should probably be it's own issue, as it's new code to write, with the task to write additional tests to handle the null-like values.