cellengine / cellengine-python-toolkit

CellEngine Python API toolkit
https://cellengine.github.io/cellengine-python-toolkit/
MIT License
4 stars 0 forks source link

`get_events`/`download_fcs_file` should provide TSV or FCS files #83

Open gegnew opened 3 years ago

gegnew commented 3 years ago

A file can be requested and opened as a DataFrame, or saved as an .fcs file, but not as a .tsv, even though the API provides this option: https://docs.cellengine.com/api/#download-file-or-population-fcs-or-tsv

gingo00 commented 2 years ago

Right now, if I am not mistaking, get_eventsalways fetch a .fcs file. If no "destination" file is specified, then the events are parsed by an FCSParser and retrieved into a dataframe.

I think this can be a problem if compensatedQ / compensationId are specified, because the new, possibly updated spill matrix returned in the fetched FcsFile is not used (spill_stringproperty not updated, only the eventsproperty appears to be updated) and not applied. Thus, compensatedQ / compensationId has no real effect.

To me, fetching an event dataframe should be more analogous to fetching a "tsv" format, because the user really just want the data table. That way, if compensation is specified in the query, the returned dataframe would already be "compensated", because for "tsv" format, this is done server-side. This could be particularly helpful if the GET /experiments/:exptId/fcsfiles/:fileid.fcs|tsv cellengine API endpoint get updated to support fetching only selected channels (which you would want to get already compensated).

zbjornson commented 2 years ago

spill_stringproperty not updated, only the eventsproperty appears to be updated) and not applied. Thus, compensatedQ / compensationId has no real effect.

That's a good point. The only reason we support FCS format is for smaller/faster transfers than TSV, but that is a major speedup.

What if get_events applied the compensation from the downloaded FCS file's spill string if compensatedQ=True? Then it would be symmetric with TSV format's behavior, and wouldn't overwrite the spill_string, which is meant to be the one in the original FCS file.

(Requests with subsets of channels would have to use TSV format, as you said.)

gingo00 commented 2 years ago

but that is a major speedup.

Yes I assumed so. Seems like a column-major binary format would possibly have been optimal for this because both the input (gata) and the output (pandas dataframe) use column-major representation.

What if get_events applied the compensation from the downloaded FCS file's spill string if compensatedQ=True

What you are proposing would work, provided I guess that the compensation.apply function is modified to not recursively call get_events again like it currently does .