Fix boolean columns during merges of extra info at read time

pinin4fjords commented 2 years ago

Currently, if extra obs or var are supplied at read time and cover all the obs or var, boolean string columns from the extras ('True', 'False') are interpreted correctly and become booleans.

If the extra obs or var do NOT cover all the obs or var in the input (e.g. where the matrix contains spikes), then the column instead becomes a pandas 'object' column, with string values 'True', 'False' and 'nan', with the 'nan' values added for the missing ones.

I don't see a way to tweak the merge call to fix this directly, but we can detect when a column consists entirely of 'True', 'False' and 'nan', and fix accordingly- which is what this PR does.

I considered setting the 'nan' to 'False', but decided against it, I think we're better making explicitly a missing value in a boolean column.

The use case which triggered this was 'mito', which we supply from the script which derives gene annotations. The mito column in the above situation cannot cannot currently be used correctly in the filtering step because it's not interpreted as boolean and doesn't then lead to pct_mito being calculated. Since we're introducing mitochondrial filtering into our standard pipelines this is a necessary fix.

Correction, 24/9

Actually, the issue was not with MTX reading, it's due to the way anndata (probably hdf5 itself) was storing the NaNs on disk. So we actually need to institute a fix whenever the anndata object is read. Also, boolean columns are not nullable, so the empty values need to be False after all.

pcm32 commented 2 years ago

The best way to make sure that these things actually work is in my opinion to add some doctests for the specific functions. You put some little test file with the content, you give a read and try the functionality. And then little by little we start to accumulate tests. I would suggest that you do that.

pcm32 commented 2 years ago

The code might look good to me, but is difficult to anticipate real problems like this.

pinin4fjords commented 2 years ago

@pcm32 sure. I'll merge this for now, I want to get it through conda etc so I can use it next week to get pipelines going again, and I've tested it thoroughly now, but I'll look at instituting better testing.

ebi-gene-expression-group / scanpy-scripts

Fix boolean columns during merges of extra info at read time #109