Open schristley opened 4 months ago
$ make
Helper commands for AIRR Standards repository
make gen-v2 -- Generate OpenAPI V2 spec from the V3 spec
make build-docs -- Build documentation
make spec-copy -- Copy spec files to language directories
make data-copy -- Copy test data files to language directories
make checks -- Run consistency checks on spec files
make tests -- Run all language test suites
make python-tests -- Run Python test suite
make r-tests -- Run R test suite
make js-tests -- Run Javascript test suite
@javh @bussec Are we allowing NA
to be in the rearrangement TSV? I'm reconciling the test data and for the bad_rearrangement.tsv
file, the R version has an NA while the python version does not. If I try to use the R version with the NA then python crashes:
Traceback (most recent call last):
File "/work/lang/python/tests/test_interface.py", line 59, in test_load_rearrangement
result = airr.load_rearrangement(self.rearrangement_bad)
File "/work/lang/python/airr/interface.py", line 103, in load_rearrangement
df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 4
it seems even if we don't accept NA, this is maybe a python bug?
Catching the exception prevents the crash, but I'm not sure if that is what we want, or if we want pandas to allow NA and transform to None.
diff --git a/lang/python/airr/interface.py b/lang/python/airr/interface.py
index 590c3a7..07b194b 100644
--- a/lang/python/airr/interface.py
+++ b/lang/python/airr/interface.py
@@ -100,10 +100,15 @@ def load_rearrangement(filename, validate=False, debug=False):
# TODO: test pandas.DataFrame.read_csv with converters argument as an alterative
schema = RearrangementSchema
- df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
- dtype=schema.pandas_types(), true_values=schema.true_values,
- false_values=schema.false_values)
- # added to use RearrangementReader without modifying it:
+ try:
+ df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
+ dtype=schema.pandas_types(), true_values=schema.true_values,
+ false_values=schema.false_values)
+ # added to use RearrangementReader without modifying it:
+ except Exception as e:
+ sys.stderr.write('Error occurred while loading AIRR rearrangement file: %s\n' % e)
+ return None
+
buffer = StringIO() # create an empty buffer
df.to_csv(buffer, sep='\t', index=False) # fill buffer
buffer.seek(0) # set to the start of the stream
The R library will accept ""
(empty string), NA
, or None
for null values. Though, the spec officially only recognizes an empty string as a null value.
I think allowing NA
to equate to None
in python would be fine (though NA
is valid amino acid sequence), but I think it's less of a python bug and more of an invalid bad_rearrangement.tsv
file... It is supposed to be "bad", I guess. But, how bad?
It is supposed to be "bad", I guess. But, how bad?
bad, but no so bad it causes a crash!
It isn't so much about what to test in the "bad" file. I'm more worried that in a "good" file there is an NA
and R accepts it, but python doesn't, and/or we get incompatibility where an R output file cannot be fed into python because it has NA
s.
The R library will accept
""
(empty string),NA
, orNone
for null values. Though, the spec officially only recognizes an empty string as a null value.
I would suggest that on input from an AIRR file, NA/None should not be interpreted as null and this should be rejected. It is non compliant if the data has NA/None for null, no?
Also on output, it should never output NA/None for null, it should always output an empty string.
Yeah, it's true that the files are non-compliant if they include NA/None. And the R and python libraries do output empty string for NA/None values.
But, NA/None tend to be the default outputs from TSV writers outside the airr reference libraries. So, it's a compromise to deal with typical TSV output.
Sounds like there are two things here. 1) change the test so it works for both R and python. That should be easy then. 2) python and R need more support for null-like values, for cross-language interoperability of AIRR TSV. That should probably be it's own issue, as it's new code to write, with the task to write additional tests to handle the null-like values.
Start building the base so we only need to edit the OpenAPI v3 spec, and the v2 spec can be generated from v3. Also clean up the test suites so there is one common set of test data files used by all languages.
Restarted as I messed up the master merge with the old PR #758