AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Update cell type checkpoints to include the updated reference file names #178

Closed allyhawkins closed 8 months ago

allyhawkins commented 8 months ago

Related to https://github.com/AlexsLemonade/ScPCA-admin/issues/691

We recently updated the reference file names for both SingleR and CellAssign. Before doing this, we had previously run a few samples through both SingleR and CellAssign. To skip cell typing, we check that the reference file name stored in library_id_cellassign/scpca-meta.json and library_id_singler/scpca-meta.json match the reference file names that have been passed through the workflow via the project metadata. If we want to run the projects through again and skip running CellAssign for samples that already have CellAssign results, then these reference files need to be updated in the scpca-meta.json files.

Here I'm adding a script that specifically updates the cell type scpca-meta.json files to make sure that the reference file names match what's in scpca-project-celltype-metadata.tsv. It's mostly modeled after the script we use for updating the mapping related scpca-meta.json files, but here we need to update two checkpoint files per library, one for each method.

Also we want to account for values that may already be there, but with a different file name. So we directly compare what's in the checkpoint file vs. what's in the metadata file and update accordingly. Additionally, if NA is in the project metadata, then we don't set the path and fill with NA. Although this shouldn't really affect any of the files that will get updated here. I removed all old SingleR files, so in reality only the CellAssign results are getting updated, and they only exist if there was a reference file in the first place.

I've tested this with runs in scpca/processed. Once this gets approved I'll run for scpca-prod.

allyhawkins commented 8 months ago

The structure here seems good, but I think we want to be a bit more strict about what files names we change, so that we don't accidentally run this in the future and update things we don't want to update. Specifically, we should check that the file name is of the "old" format before updating, so we don't accidentally update version numbers in the future when we don't really want to.

This is a great point! I went ahead and added a check for a version string. If that's present in either of the filenames that are in scpca-meta.json, then no updates are made. I also updated the description of the script at the top of the file to mention that.