SS-1: First pass at the ADRECS brick, Need help with DVC pipeline

Sulstice commented 3 months ago

Hi,

In this PR, I've added the module ADRECS into the bricks layout. The module is simple with two functions download_data and process_data. It is turned into a click command pipeline.

The file formats the data is ingested in is gzip and xlsx in byte format. I outputted into the parquet files.

@zmughal

I added the dvc.yaml file and you can run the code with dvc repro however I can't figure out the correct file pathing with the yaml file and wondering if you can help me. Why is there raw and list in some and what does that mean?


    deps:
      - ./stages/01_adrecs.py
    outs:
      - brick

I can probably figure it out soon but worth putting up the PR now.

zmughal commented 3 months ago

Just to show the wget approach:

wget -r \
       --accept-regex '.*/ADReCS/download/.*' \
       -P download \
       --no-host-directories --cut-dirs=2 \
       'http://bioinf.xmu.edu.cn/ADReCS/download.jsp'

zmughal commented 3 months ago

I made a couple of changes.

The issue was with processing ADReCS_ADR_Severity_Grade_v3.3.txt.gz, not Drug_ADR_v3.3.txt.gz: that file is full of "-1" and strings such as "Mild", "Moderate" so the types were being guessed to be numeric by the reader. Adding the low_memory=False lets read_csv() see the entire file so it assigns them to strings.

I replaced both '-1' (str) and -1.0 (float) with NA to indicate null (as written on the website) and for the severity grade, turned those columns into categorical data.

biobricks-ai / adrecs

SS-1: First pass at the ADRECS brick, Need help with DVC pipeline #5