Check for missing csv entries

Emory-HITI / Niffler

Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.

https://emory-hiti.github.io/Niffler/

BSD 3-Clause "New" or "Revised" License

90 stars 53 forks source link

Check for missing csv entries #15

Closed pradeeban closed 3 years ago

pradeeban commented 4 years ago

For example, missing accession in an EMPI-Accession extraction.

Nishchal-007 commented 3 years ago

I have made a python script for this Here is the Github link to the repo: https://github.com/Nishchal-007/CheckingNull Also, there is a dummy CSV file for testing purpose

Let me know if there's anything else to be done.

pradeeban commented 3 years ago

A few pointers:

When we say missing entries, what we mean is a file:

empi, accassion 32323, 32323323232 23323, 323232, 332323 , 3232332323

In this case, the line 2 and 4 should be skipped, but the rest of the extraction should go on.

It is not the entire CSV that is empty or should be considered invalid and ignored. Rather, making the executions succeed, despite the missing lines.
Also, I recommend cloning the repository and submitting pull requests, rather than creating private repositories with stand-alone codes. The code changes will modify the https://github.com/Emory-HITI/Niffler/blob/dev/modules/cold-extraction/ColdDataRetriever.py (Please use the dev branch).

Nishchal-007 commented 3 years ago

Okay, Will do that !!

Nishchal-007 commented 3 years ago

I have forked and made changes to the cold-extraction module and added the checkCSV.py file. Can you please check it out?

pradeeban commented 3 years ago

That pull request modifies 67 files. Please resubmit a new pull request without modifying other files, also addressing other comments in the pull request.

Nishchal-007 commented 3 years ago

Yeah I checked the points in the pull request My approach to this current issue is if there are any missing values in the file I'm just dropping them in the beginning itself Is this approach okay or should I change this?

And regarding the file changes, I did "add all" that's why it affected all the files But no file is modified or changed internally

pradeeban commented 3 years ago

Ideally, you should drop it on the go, as otherwise, this process of reading the file will introduce an initial delay.

Also, there is no reason to create a separate python class. You can modify modules/cold-extraction/ColdDataRetriever.py.

You can use the "csv_file" variable there as it already points to the csv_file.

Nishchal-007 commented 3 years ago

Okay

Nishchal-007 commented 3 years ago

And also I'm applying for GSOC 2021

Nishchal-007 commented 3 years ago

Made the necessary changes in the ColdDataRetriever.py file

pradeeban commented 3 years ago

Please pay attention to the comments.

https://github.com/Emory-HITI/Niffler/pull/115/files changes files unnecessarily. Please avoid changing files that have no relevance to the pull request. Also, I believe you are modifying only the ColdDataRetriever.py and other changes (such as a sample CSV file) are not necessary.

Nishchal-007 commented 3 years ago

Only changed the ColdDataRetriever.py file

pradeeban commented 3 years ago

Added some important comments in your pull request. Please attend to them and submit an updated request when you have tested the changes.

A comment that is more specific to elaborate this issue:

Please note that you should check only for the missing entry of the mandatory fields.

empi_date: make sure neither empi nor date is empty (there could be other fields that can be empty).
empi: make sure empi is not empty (there could be other fields that can be empty).
accession: make sure accession is not empty (there could be other fields that can be empty).
empi_accession: make sure neither empi nor accession is empty (there could be other fields that can be empty).

Nishchal-007 commented 3 years ago

Hey, I made the required changes to the ColdDataRetriever.py file

Can you please check it out?

pradeeban commented 3 years ago

Looks good now. Merged to dev. Thanks.