Vascular disease - Lifelines categorical variable in clinical status

hcadavid commented 1 year ago

Pairing rule issue

Description

According to the comments on the spreadsheets, atherosclerosis_presence_adu_q_1 is a multiple choice question (could you indicate which of the following disorders you have (had)?), where 'atherosclerosis' is one answer, but also f.e. 'heart valve problems'. @baukearends and @squareb, When should we consider it for the clinical status? (e.g., only if a certain answer is given? or if any value is given?)

@squareb : in this particular variable, how a 'NONE' would be represented? In the documentation, there is a reference to the tokens used for missings, but I assume these are not used in that particular case.

Specification https://github.com/MyDigiTwinNL/Lifelines2Medmij-Mapping-tool/blob/6c17f756e86c952d80fb382c2fb043ce5e1fc745/src/lifelines/VascularDisease.ts#L36-L63

baukearends commented 1 year ago

According to the comments @squareb provided, there are four options:

Myocardial infarction Stroke (both ischemic and hemorrhagic) Heart failure / cardiomyopathy Intermittent claudication

The diseases in bold should be included, the other two should be excluded as they are a separate entity. In the spreadsheets I don't see the option for 'heart valve problems', but if present, it should not be included in Vascular disease.

hcadavid commented 1 year ago

@baukearends, I think these four options are actually for cvd_followup_adu_q_1, according to the discussion (see below) from the spreadsheets. However, from that discussion, it is still not clear what are the specific categories of atherosclerosis_presence_adu_q_1, and in the catalog, only one is shown: atherosclerosis (arteriosclerosis). @squareb could you give us some more information about it, and how could we use it as an additional indicator of 'active' VD?

atherosclerosis_presence_adu_q_1 seems to be a multiple choice question, is there a way to see the possible answers?

Bas: The main question is: 'could you indicate which of the following disorders you have (had)?', of which 'atherosclerosis' is one answer, but also f.e. 'heart valve problems'

cvd_followup_adu_q_1: is it possible to see the list of 'following conditions'?

Bas: Yes, these are: 
- heart infarction
 stroke (brain infarction, brain haemorrhage)
- heart failure/heart muscle disorder
- clogged arteries in the legs (intermittent claudication)

squareb commented 1 year ago

@hcadavid For the variable regarding atherosclerosis_presence_adu_q_1, the parent question is "could you indicate which of the following disorders you have (had)?", of which the participant could indicate four possible answers, namely:

atherosclerosis
heart valve problems
thrombosis
pulmonary embolism

I have to say that this may have been somewhat confusing to find on the wiki, since the first two are CVD-related and can be found here: http://wiki.lifelines.nl/doku.php?id=cardiovascular_diseases. While thrombosis and pulmonary embolism can be found in http://wiki.lifelines.nl/doku.php?id=blood_disorders.

The NONE-outcome may be defined when the participants did fill in the questionnaires (which applies when they have a record for a specific assessment), but did not indicate that they had any of the disorders for any given time point. The possibility that the participant (accidently) skipped the question always remains, but that will most likely be the minority.

baukearends commented 1 year ago

Apologies, I indeed mentioned the wrong subanswers. For our variable Vascular disease, only the subanswer atherosclerosis is relevant.

hcadavid commented 1 year ago

I came back to this one now that we have access to the data. From what I understand from this discussion thread, atherosclerosis_presence_adu_q_1 is one of four options (including heart valve problems, thrombosis, pulmonary embolism) of a common parent question. Given this, I was expecting yes/no values. However, when looking at the content, I found values between 0 and 63 (@baukearends I also dumped the column in our tmp01 folder - hcadavid/analysis/atherosclerosis_presence_adu_q_1_col.txt). I haven't found information about how the data in this column is encoded. @squareb could you give us some hints on this?

squareb commented 11 months ago

This is indeed weird, could it be that you may have extracted the wrong column? For example, the column 'asthma_startage_adu_q_1' is next to the 'atherosclerosis_presence_adu_q_1' and this column does have values between f.e. 0 and 63. The column atherosclerosis_presence_adu_q_1 should only have '1' values as answers.

hcadavid commented 11 months ago

@squareb I'm double-checking but still getting the same results.

I'm extracting column #85 from 1a_q_1_results.csv file, which seems to be the right one, as it returns "atherosclerosis_presence_adu_q_1" as the first value:

cut -d',' -f85 ./1a_q_1_results.csv | more

Output:

"atherosclerosis_presence_adu_q_1"
"$6"
...

When sorting the results,

cut -d',' -f85 ./1a_q_1_results.csv | sort

I'm still getting this small set of 'odd' values at the end:

....
"7"
"7"
"8"
"8"
"8"
"8"
"9"
"9"
"9"

Please let me know if I may be missing something.

squareb commented 10 months ago

@hcadavid Are you performing the cut command on an unmodified 1a_q_1_results.csv file? Or did you do alter the file to f.e. only include certain participants? When I do cut -f85 I'm getting data from the column asthma_presence_adu_q_2

hcadavid commented 10 months ago

@squareb As far I understand I'm working on an unmodified copy of the file (its timestamp is the same as the other files - Jul 18 10:21 1a_q_1_results.csv)

Here is the hash code of the file so you can check if we are using the same one:

sha1sum 1a_q_1_results.csv 1187c160c51a2a13137ac0c48e70b3b0b5bcea25 1a_q_1_results.csv

squareb commented 10 months ago

@hcadavid I may have found a clue on why this is going wrong (it's been a bit of puzzle). The data file contains a variable that includes open text. In these open text values, some participants gave an answer which included a comma. If you would then provide the comma as a field seperator for the cut-command, it doesn't exclude the comma's from the open text field, which is why the columns are being shuffled. For example, the '63' value you're getting is from the neighbouring column 'asthma_startage_adu_q_1', because the participant gave an open text answer including a comma in the variable: 'allergen_other_adu_q_1_a'.

Hopefully this helps and clarifies the issue.

MyDigiTwinNL / CDF2Medmij-Mapping-tool

Vascular disease - Lifelines categorical variable in clinical status #3