MetaCell / geppetto-NeuroSCAN

Yale NeuroSCAN & Promoter DB Project
MIT License
3 stars 0 forks source link

Data validator #309

Closed afonsobspinto closed 1 year ago

afonsobspinto commented 1 year ago

closes (1st iteration of) https://metacell.atlassian.net/browse/YALE-6

Misc:

The ingest all script starts by cleaning a specified output directory, then parses and validates data from the two apps (NeuroScan and PromoterDB). Any issues encountered during parsing are logged to a file. If no critical errors are found, the script exports the validated data to an output directory in a structured format (CSV and JSON). Optionally, if the --transform flag is set, the script can zip CPHATE plot data and convert AVI files to MP4 format. If the --dry-run flag is not set, the script proceeds to ingest this exported data into a database.

So far I have only tested the general flow of the script with a very small custom made dataset:

https://github.com/MetaCell/geppetto-NeuroSCAN/assets/19196034/e3c267ea-0f2e-4097-a9ba-1089896b6cef

I couldn't test with client's dataset because it doesn't follow the rules defined. cc @ddelpiano Here are a couple of issues reported:

In the process of doing this task I also noticed a couple of potential problems / missing components in our data model.

The script might still need some adjustments to be completely equivalent to how it was previously used. I'm running it locally with:

python ingestall.py --root-dir ./data_new --transform --dry-run

which outputs the csv's to be ingested in geppetto-NeuroSCAN/ingestion/output

EDIT:

Added instructions on how to run the validator: image

zsinnema commented 1 year ago

@afonsobspinto yesterday I ran the current ingestion software for L1/0 and it ingested without issues. So if your validation software fails on L1/0 then please cross check the new validation rules with the old ingestion software

afonsobspinto commented 1 year ago

Moving the PR to draft as we need to use the previous ingestion process as the source of truth instead of the yale model presentation

afonsobspinto commented 1 year ago

I compared the parsers output from both old and new ingestion and they match for most of it.

There are two meaningful differences:

Here are the files produced parsers_output.tar.gz

mohler commented 1 year ago

Alphonso-

Synapses marked to include non-neuronal cells are not necessarily wrong. Some synapses appear to involve other cell types. The annotations correctly reflect the observation of the annotator. PLEASE, do not trash them. Accommodate them.

Thanks, Bill Mohler

On Fri, Sep 8, 2023, 10:51 AM Afonso Pinto @.***> wrote:

I compared the parsers output from both old and new ingestion and they match for most of it.

There are two meaningful differences:

  • neurons added in the new ingestion compared to the previous one:

BWM-DL01 BWM-DL02 BWM-DL03 BWM-DL04 BWM-DL05 BWM-DL06 BWM-DL08 BWM-DR01 BWM-DR02 BWM-DR03 BWM-DR04 BWM-DR05 BWM-DR06 BWM-DR07 BWM-DR08 BWM-VL01 BWM-VL02 BWM-VL03 BWM-VL04 BWM-VL05 BWM-VL06 BWM-VL07 BWM-VL08 BWM-VR01 BWM-VR02 BWM-VR03 BWM-VR04 BWM-VR05 BWM-VR06 BWM-VR07 BWM-VR08

  • synapses not added in the new ingestion (due to invalid neuron references, aka at least one of the neurons mentioned in the synapse does not exist):

SVV_ADLRundefinedAIAR_CEPshVR-A_post1 SVV_ADLRundefinedAIAR_CEPshVR-A_post2 SVV_ADLRundefinedAIAR_CEPshVR-A_pre SVV_AIMLundefinedASJL_CEPshVL-A_post1 SVV_AIMLundefinedASJL_CEPshVL-A_post2 SVV_AIMLundefinedASJL_CEPshVL-A_pre SVV_AINLundefinedCEPshDR-A_post1
SVV_AINLundefinedCEPshDR-A_pre SVV_AINLundefinedCEPshVL-A_post1
SVV_AINLundefinedCEPshVL-A_pre
SVV_AINLundefinedCEPshVR_BAGL-A_post1 SVV_AINLundefinedCEPshVR_BAGL-A_post2 SVV_AINLundefinedCEPshVR_BAGL-A_pre SVV_AINRundefinedCEPVL_CEPshVL-A_post1
SVV_AINRundefinedCEPVL_CEPshVL-A_post2
SVV_AINRundefinedCEPVL_CEPshVL-A_pre SVV_ASIRundefinedAWCR_CEPshVR-A_post1 SVV_ASIRundefinedAWCR_CEPshVR-A_post2 SVV_ASIRundefinedAWCR_CEPshVR-A_pre
SVV_ASJLundefinedPVQL_CEPshDL-A_post1 SVV_ASJLundefinedPVQL_CEPshDL-A_post2 SVV_ASJLundefinedPVQL_CEPshDL-A_pre SVV_BDULundefinedBDUR_GLRDR-A_post1
SVV_BDULundefinedBDUR_GLRDR-A_post2
SVV_BDULundefinedBDUR_GLRDR-A_pre SVV_DVCundefinedCEPshVL_RIBL-A_post1
SVV_DVCundefinedCEPshVL_RIBL-A_post2
SVV_DVCundefinedCEPshVL_RIBL-A_pre SVV_IL2DLundefinedGLRDL-A_post1
SVV_IL2DLundefinedGLRDL-A_pre SVV_PVQRundefinedAIAR_CEPshVR-A_post1 SVV_PVQRundefinedAIAR_CEPshVR-A_post2 SVV_PVQRundefinedAIAR_CEPshVR-A_pre SVV_RIARundefinedCEPVL_GLRVL-A_post1
SVV_RIARundefinedCEPVL_GLRVL-A_post2
SVV_RIARundefinedCEPVL_GLRVL-A_pre SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post1
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post2
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post3
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post4
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_pre SVV_RIFLundefinedAVEL_CEPshVL-A_post1 SVV_RIFLundefinedAVEL_CEPshVL-A_post2 SVV_RIFLundefinedAVEL_CEPshVL-A_pre SVV_RIGLundefinedCEPshVL_RIR-A_post1
SVV_RIGLundefinedCEPshVL_RIR-A_post2
SVV_RIGLundefinedCEPshVL_RIR-A_pre SVV_RIHundefinedIL2L_GLRL-A_post1 SVV_RIHundefinedIL2L_GLRL-A_post2 SVV_RIHundefinedIL2L_GLRL-A_pre
SVV_RIHundefinedIL2L_GLRL-B_post1 SVV_RIHundefinedIL2L_GLRL-B_post2 SVV_RIHundefinedIL2L_GLRL-B_pre SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post1 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post2 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post3 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_pre SVV_RIHundefinedRIBL_CEPshVL-A_post1
SVV_RIHundefinedRIBL_CEPshVL-A_post2
SVV_RIHundefinedRIBL_CEPshVL-A_pre SVV_RIVRundefinedGLRVL_excgl-A_post1
SVV_RIVRundefinedGLRVL_excgl-A_post2
SVV_RIVRundefinedGLRVL_excgl-A_pre
SVV_RIVRundefinedGLRVL_GLRVR-A_post1
SVV_RIVRundefinedGLRVL_GLRVR-A_post2
SVV_RIVRundefinedGLRVL_GLRVR-A_pre
SVV_RIVRundefinedGLRVL_SAADL-A_post1
SVV_RIVRundefinedGLRVL_SAADL-A_post2
SVV_RIVRundefinedGLRVL_SAADL-A_pre SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post1
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post2
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post3
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_pre
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post1
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post2
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post3
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_pre

Both differences seem to me, correct in the newer version but please let me know.

Here are the files produced parsers_output.tar.gz https://github.com/MetaCell/geppetto-NeuroSCAN/files/12560999/parsers_output.tar.gz

— Reply to this email directly, view it on GitHub https://github.com/MetaCell/geppetto-NeuroSCAN/pull/309#issuecomment-1711796658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4S3AYVA4YJDDUDAC2XXS3XZMWFJANCNFSM6AAAAAA3SP2XSI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mohler commented 1 year ago

Pardon me for misspelling your name Afonso.

To clarify a few things:

First, eliminating these synapses from the model removes the opportunity for groundbreaking insights by NeuroSCAN users into the interface between nerve and non-nerve cell types. This is truly the cutting edge of neuroscience in 2023. Please do not tidy it out of existence to fit the data structure.

Second, I am the originator of the code and the producer if all the models that make up the NeuroSCAN content base. I developed the nomenclature, the color scheme, the spatial registration, etc for all model components.

I introduced myself some time ago on this GitHub site and received no reply from anyone at MetaCell. I consider this unfortunate, since direct consultation could have reduced cost and delay substantially.

At Noelle's request, I am currently generating updated gltfs for all data sets. I would appreciate openness from MetaCell to my asking questions to streamline the fit of new models to your now established data structure.

Please give me some sort of acknowledgement that you are receiving this communication from me now.

Bill Mohler

On Fri, Sep 8, 2023, 10:51 AM Afonso Pinto @.***> wrote:

I compared the parsers output from both old and new ingestion and they match for most of it.

There are two meaningful differences:

  • neurons added in the new ingestion compared to the previous one:

BWM-DL01 BWM-DL02 BWM-DL03 BWM-DL04 BWM-DL05 BWM-DL06 BWM-DL08 BWM-DR01 BWM-DR02 BWM-DR03 BWM-DR04 BWM-DR05 BWM-DR06 BWM-DR07 BWM-DR08 BWM-VL01 BWM-VL02 BWM-VL03 BWM-VL04 BWM-VL05 BWM-VL06 BWM-VL07 BWM-VL08 BWM-VR01 BWM-VR02 BWM-VR03 BWM-VR04 BWM-VR05 BWM-VR06 BWM-VR07 BWM-VR08

  • synapses not added in the new ingestion (due to invalid neuron references, aka at least one of the neurons mentioned in the synapse does not exist):

SVV_ADLRundefinedAIAR_CEPshVR-A_post1 SVV_ADLRundefinedAIAR_CEPshVR-A_post2 SVV_ADLRundefinedAIAR_CEPshVR-A_pre SVV_AIMLundefinedASJL_CEPshVL-A_post1 SVV_AIMLundefinedASJL_CEPshVL-A_post2 SVV_AIMLundefinedASJL_CEPshVL-A_pre SVV_AINLundefinedCEPshDR-A_post1
SVV_AINLundefinedCEPshDR-A_pre SVV_AINLundefinedCEPshVL-A_post1
SVV_AINLundefinedCEPshVL-A_pre
SVV_AINLundefinedCEPshVR_BAGL-A_post1 SVV_AINLundefinedCEPshVR_BAGL-A_post2 SVV_AINLundefinedCEPshVR_BAGL-A_pre SVV_AINRundefinedCEPVL_CEPshVL-A_post1
SVV_AINRundefinedCEPVL_CEPshVL-A_post2
SVV_AINRundefinedCEPVL_CEPshVL-A_pre SVV_ASIRundefinedAWCR_CEPshVR-A_post1 SVV_ASIRundefinedAWCR_CEPshVR-A_post2 SVV_ASIRundefinedAWCR_CEPshVR-A_pre
SVV_ASJLundefinedPVQL_CEPshDL-A_post1 SVV_ASJLundefinedPVQL_CEPshDL-A_post2 SVV_ASJLundefinedPVQL_CEPshDL-A_pre SVV_BDULundefinedBDUR_GLRDR-A_post1
SVV_BDULundefinedBDUR_GLRDR-A_post2
SVV_BDULundefinedBDUR_GLRDR-A_pre SVV_DVCundefinedCEPshVL_RIBL-A_post1
SVV_DVCundefinedCEPshVL_RIBL-A_post2
SVV_DVCundefinedCEPshVL_RIBL-A_pre SVV_IL2DLundefinedGLRDL-A_post1
SVV_IL2DLundefinedGLRDL-A_pre SVV_PVQRundefinedAIAR_CEPshVR-A_post1 SVV_PVQRundefinedAIAR_CEPshVR-A_post2 SVV_PVQRundefinedAIAR_CEPshVR-A_pre SVV_RIARundefinedCEPVL_GLRVL-A_post1
SVV_RIARundefinedCEPVL_GLRVL-A_post2
SVV_RIARundefinedCEPVL_GLRVL-A_pre SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post1
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post2
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post3
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_post4
SVV_RIBRundefinedAVAR_RIAR_BAGL_CEPshVR-A_pre SVV_RIFLundefinedAVEL_CEPshVL-A_post1 SVV_RIFLundefinedAVEL_CEPshVL-A_post2 SVV_RIFLundefinedAVEL_CEPshVL-A_pre SVV_RIGLundefinedCEPshVL_RIR-A_post1
SVV_RIGLundefinedCEPshVL_RIR-A_post2
SVV_RIGLundefinedCEPshVL_RIR-A_pre SVV_RIHundefinedIL2L_GLRL-A_post1 SVV_RIHundefinedIL2L_GLRL-A_post2 SVV_RIHundefinedIL2L_GLRL-A_pre
SVV_RIHundefinedIL2L_GLRL-B_post1 SVV_RIHundefinedIL2L_GLRL-B_post2 SVV_RIHundefinedIL2L_GLRL-B_pre SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post1 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post2 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_post3 SVV_RIHundefinedRIAL_RIBL_CEPshVL-A_pre SVV_RIHundefinedRIBL_CEPshVL-A_post1
SVV_RIHundefinedRIBL_CEPshVL-A_post2
SVV_RIHundefinedRIBL_CEPshVL-A_pre SVV_RIVRundefinedGLRVL_excgl-A_post1
SVV_RIVRundefinedGLRVL_excgl-A_post2
SVV_RIVRundefinedGLRVL_excgl-A_pre
SVV_RIVRundefinedGLRVL_GLRVR-A_post1
SVV_RIVRundefinedGLRVL_GLRVR-A_post2
SVV_RIVRundefinedGLRVL_GLRVR-A_pre
SVV_RIVRundefinedGLRVL_SAADL-A_post1
SVV_RIVRundefinedGLRVL_SAADL-A_post2
SVV_RIVRundefinedGLRVL_SAADL-A_pre SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post1
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post2
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_post3
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-A_pre
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post1
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post2
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_post3
SVV_SMBVLundefinedBWM-VL04_BWM-VL02_GLRL-B_pre

Both differences seem to me, correct in the newer version but please let me know.

Here are the files produced parsers_output.tar.gz https://github.com/MetaCell/geppetto-NeuroSCAN/files/12560999/parsers_output.tar.gz

— Reply to this email directly, view it on GitHub https://github.com/MetaCell/geppetto-NeuroSCAN/pull/309#issuecomment-1711796658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4S3AYVA4YJDDUDAC2XXS3XZMWFJANCNFSM6AAAAAA3SP2XSI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

afonsobspinto commented 1 year ago

Hi Bill, no worries about the name mix-up.

Your input was duly noted. We'll discuss this internally to ensure we're aligned and update this thread after.

Thanks for reaching out

mohler commented 1 year ago

Great! Thanks for the reply.

On Fri, Sep 8, 2023, 12:08 PM Afonso Pinto @.***> wrote:

Hi Bill, no worries about the name mix-up.

Your input was duly noted. We'll discuss this internally to ensure we're aligned and update this thread after.

Thanks for reaching out

— Reply to this email directly, view it on GitHub https://github.com/MetaCell/geppetto-NeuroSCAN/pull/309#issuecomment-1711910561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4S3AZLD3N5KUMH5ZJBQTLXZM7H3ANCNFSM6AAAAAA3SP2XSI . You are receiving this because you commented.Message ID: @.***>

afonsobspinto commented 1 year ago

For now, the validator was adapted to mark the unrecognized references to a neuron in a synapse name as a warning instead of an error. This makes the ingestion step still process that entry, only creating relations (with neurons) for the ones it can find:

image (SVV_ADLRundefinedAIAR_CEPshVR-A_post1, synapse previously mention as left out but now ingested)

mohler commented 1 year ago

That's great. Thank you.

On Wed, Sep 13, 2023, 11:18 AM Afonso Pinto @.***> wrote:

For now, the validator was adapted to mark the unrecognized references to a neuron in a synapse name as a warning instead of an error. This makes the ingestion step still process that entry, only creating relations (with neurons) for the ones it can find:

[image: image] https://user-images.githubusercontent.com/19196034/267703916-7921a220-ad12-48e3-9c2a-0d1b45e0ffaa.png (SVV_ADLRundefinedAIAR_CEPshVR-A_post1, synapse previously mention as left out but now ingested)

— Reply to this email directly, view it on GitHub https://github.com/MetaCell/geppetto-NeuroSCAN/pull/309#issuecomment-1717841836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4S3A7ZRSTMUQ56KLYDHBTX2HFDXANCNFSM6AAAAAA3SP2XSI . You are receiving this because you commented.Message ID: @.***>

ddelpiano commented 1 year ago

hi @mohler , as discussed with Noelle at our biweekly meetings we are going to accommodate the change request made and ingest these synapses, in the long term the search mechanism will be modified to make these data 'searchable' since the data model and component originally design following the requirements given don't allow the user to search for this data. I will keep you posted on future updates related to these data.

Thanks, Dario

mohler commented 1 year ago

Thank you!

mohler commented 1 year ago

Thank you!!

On Thu, Sep 28, 2023, 9:10 AM Dario @.***> wrote:

hi @mohler https://github.com/mohler , as discussed with Noelle at our biweekly meetings we are going to accommodate the change request made and ingest these synapses, in the long term the search mechanism will be modified to make these data 'searchable' since the data model and component originally design following the requirements given don't allow the user to search for this data. I will keep you posted on future updates related to these data.

Thanks, Dario

— Reply to this email directly, view it on GitHub https://github.com/MetaCell/geppetto-NeuroSCAN/pull/309#issuecomment-1739135036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4S3A3RXTWAY7LCDGRU4XLX4VZOHANCNFSM6AAAAAA3SP2XSI . You are receiving this because you were mentioned.Message ID: @.***>