Marshmallow Error in cell_type_mapper.cli.from_specified_markers

mtvector commented 8 months ago

Hi there, Thanks for putting this nice package together. I've tried to create a mapping using the following. Everything works fine up until the last step (or so it seems!)

H5AD=/allen/programs/celltypes/workgroups/hct/cellTaxonomy/adult-human-brain_v1/additional_files/01_2024/Neurons.h5ad
REF_PATH=/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/data/fenna_cj_mapping/siletti_reference

python -m cell_type_mapper.cli.precompute_stats_scrattch \
--h5ad_path $H5AD \
--hierarchy '["supercluster_term", "cluster_id", "subcluster_id"]' \
--output_path $REF_PATH/precompute_stats.h5 \
--normalization raw \
--tmp_dir ${REF_PATH}/temp/

python -m cell_type_mapper.cli.reference_markers \
--precomputed_path_list '["'${REF_PATH}'/precompute_stats.h5"]' \
--output_dir ${REF_PATH}/ \
--tmp_dir ${REF_PATH}/temp/

python -m cell_type_mapper.cli.query_markers \
--reference_marker_path_list '["'${REF_PATH}'/reference_markers.h5"]' \
--output_path ${REF_PATH}/

QUERY_H5AD=/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/data/fenna_cj_mapping/231114_HMBA_cjNutmeg_Slab6_Tile2_pooled1_human_ortho.h5ad
OUT_PATH="${QUERY_H5AD%.h5ad}_MAPPING"
mkdir -p $OUT_PATH

python -m cell_type_mapper.cli.query_markers \
--reference_marker_path_list '["'${REF_PATH}'/reference_markers.h5"]' \
--output_path ${OUT_PATH}/query_markers.json

python -m cell_type_mapper.cli.from_specified_markers \
--query_path $QUERY_H5AD \
--input_json ${OUT_PATH}/query_markers.json \
--type_assignment.normalization raw \
--precomputed_stats.path $REF_PATH/precompute_stats.h5 \
--output_json ${OUT_PATH}/hann_results.json \
--extended_result_path ${OUT_PATH}/ \
> ${OUT_PATH}/log_outputs.txt 2>&1

But this gives a strange error in marshmallow which I don't understand how to fix. It's quite possible I've done something incorrectly upstream of the from_specified_markers call, but I can't figure out what might be wrong with the markers.

Traceback (most recent call last):
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/cell_type_mapper/src/cell_type_mapper/cli/from_specified_markers.py", line 389, in <module>
    main()
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/cell_type_mapper/src/cell_type_mapper/cli/from_specified_markers.py", line 384, in main
    runner = FromSpecifiedMarkersRunner()
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/site-packages/argschema/argschema_parser.py", line 175, in __init__
    result = self.load_schema_with_defaults(self.schema, args)
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/site-packages/argschema/argschema_parser.py", line 276, in load_schema_with_defaults
    result = utils.load(schema, args)
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/site-packages/argschema/utils.py", line 418, in load
    results = schema.load(d)
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/site-packages/marshmallow/schema.py", line 719, in load
    return self._do_load(
  File "/allen/programs/celltypes/workgroups/rnaseqanalysis/EvoGen/Team/Matthew/utils/miniconda3/envs/mapcells/lib/python3.10/site-packages/marshmallow/schema.py", line 901, in _do_load
    raise exc

marshmallow.exceptions.ValidationError: {'query_markers': ['Missing data for required field.'], 'cluster_id/332': ['Unknown field.'], 'cluster_id/244': ['Unknown field.'], 'cluster_id/85': ['Unknown field.'], 'cluster_id/135': ['Unknown field.'], 'cluster_id/456': ['Unknown field.'], 'cluster_id/292': ['Unknown field.'], 'cluster_id/346': ['Unknown field.'], 'cluster_id/233': ['Unknown field.'], 'cluster_id/227': ['Unknown field.'], 'cluster_id/143': ['Unknown field.'], 'cluster_id/431': ['Unknown field.'], 'cluster_id/409': ['Unknown field.'], 'cluster_id/428': ['Unknown field.'], 'cluster_id/293': ['Unknown field.'], 'cluster_id/328': ['Unknown field.'], 'supercluster_term/Amygdala excitatory': ['Unknown field.'], 'cluster_id/308': ['Unknown field.'], ...

Thanks so much for any guidance you can provide!

danielsf commented 8 months ago

You are passing your ${OUT_PATH}/query_markers.json in as the argument --input_json. It should be passed in as --query_markers.serialized_lookup, i.e. the call you want is

python -m cell_type_mapper.cli.from_specified_markers \
--query_path $QUERY_H5AD \
--query_markers.serialized_lookup ${OUT_PATH}/query_markers.json \
--type_assignment.normalization raw \
--precomputed_stats.path $REF_PATH/precompute_stats.h5 \
--extended_result_path ${OUT_PATH}/hann_results.json \
> ${OUT_PATH}/log_outputs.txt 2>&1

(Note: you also do not want to use --output_json; --extended_result_path should point to the JSON file containing your cell type mapping result).

--input_json is a parameter that the argschema library always adds to its executables that points to a JSON file containing all of the parameters for the executable, i.e. if you wanted to run

python -m cell_type_mapper.cli.from_specified_markes --input_json config.json

then config.json would look like

{
'query_path': /path/to/query.h5ad'
'query_markers': {
    'serialized_lookup': '/path/to/query_markers.json' 
},
'type_assignment': {
    'normalization': 'raw'
},
'precomputed_stats': {
    'path':  '/path/to/precompute_stats.h5'
},
'extended_result_path': '/path/to/output/file.json'
}

Similarly output_json is an argument where the argschema library outputs some basic metadata from the run and is actually not used by this module.

mtvector commented 8 months ago

Thanks so much Scott, I'm new to this pipeline but that is what I needed to run successfully!

AllenInstitute / cell_type_mapper

Marshmallow Error in cell_type_mapper.cli.from_specified_markers #14