Closed Sulstice closed 3 months ago
Just to show the wget
approach:
wget -r \
--accept-regex '.*/ADReCS/download/.*' \
-P download \
--no-host-directories --cut-dirs=2 \
'http://bioinf.xmu.edu.cn/ADReCS/download.jsp'
I made a couple of changes.
The issue was with processing ADReCS_ADR_Severity_Grade_v3.3.txt.gz
, not Drug_ADR_v3.3.txt.gz
: that file is full of "-1" and strings such as "Mild", "Moderate" so the types were being guessed to be numeric by the reader. Adding the low_memory=False
lets read_csv()
see the entire file so it assigns them to strings.
I replaced both '-1'
(str) and -1.0
(float) with NA
to indicate null (as written on the website) and for the severity grade, turned those columns into categorical data.
Hi,
In this PR, I've added the module
ADRECS
into the bricks layout. The module is simple with two functionsdownload_data
andprocess_data
. It is turned into a click command pipeline.The file formats the data is ingested in is
gzip
andxlsx
in byte format. I outputted into the parquet files.@zmughal
I added the
dvc.yaml
file and you can run the code withdvc repro
however I can't figure out the correct file pathing with the yaml file and wondering if you can help me. Why is thereraw
andlist
in some and what does that mean?I can probably figure it out soon but worth putting up the PR now.