Problem with handling Date columns and multiple columns for Cox's Regression

cheemcheem commented 6 years ago

So Cox's Regression requires an event column and a duration column. This is fine for most files but I have noticed one of them actually has two date columns instead of a duration column.

This means that the difference in dates in each row needs to be calculated before it can be fitted into a Cox's Regression object.

Another point is that Cox's Regression needs two output columns but the API only has one output column as far as I'm aware.

So we need either HCI or BE to handle merging the columns in the event that it is needed or we need to include the date column information in the API.

And we need multiple output columns for Cox's Regression.

db213 commented 6 years ago

So I can change the output_column to be output_columns and have more than one column easily enough. The problem here I think is making clear how the two columns are supposed to relate to each other - how does ML know what to do with the two columns (take the difference, add them together, etc)? And then such a change should be valid for every single ML algorithm, not just for performing Cox's Regression. To complicate it even more, this means it can be possible to input > 2 columns, and then there needs to be a way of specifying what to do with each value in each column. E.g. add the first two together, then subtract the third, then multiply by the fourth.

That would be the most general approach, anyway. We could keep it super specific, accept more than one input (but it must only be at most exactly 2 inputs), and then assume the operation between the two is the absolute value of the difference between the values in each column.

Thoughts on what would be best would be good. Option two is way simpler, but not as neat and kind of arbitrary.

cheemcheem commented 6 years ago

How about this idea

output_columns can be a dictionary with values and types of: {time_columns: [str], output_column: str}

Normal algorithms can be passed just the output_columns.output_column value.

Algorithms that need times can be passed the output_columns.time_columns and the output_columns.output_column values.

If there are two time columns then the output_columns.time_columns value should be an array of two column names in order preferably (so ['start time column', 'end time column'])

If there is just one time column then the output_column.time_columns value should be an array of one column name

This way Coxs Regression and others that need times in ML can determine whether it needs to process the time_values into one column depending on if there are two values in the output_column.time_columns array or not.

db213 commented 6 years ago

Seems to me like your suggestion is basically just option 2 of the two I'd outlined but slightly more complicated. I think it would be neater to just make output columns an array of column objects, so that more than one can be accepted: output_columns: [ <column_object>] (a column object is already defined in the protocol) with the understanding that no more than two should ever be passed in. If it's only one - then no problem, the one column is the one output. If it's two columns, then you know the columns contain times, and the actual output wanted is based on taking the difference between the two columns input. So for a normal ML job the field would look like this:

output_columns: [
    {
        column_index: int
        column_type: continuous
    }
]

While the input for Cox's Regression (or whatever ML algorithm, it shouldn't actually matter) that needs to take the difference of two columns is

output_columns: [
   {
        column_index: 1
        column_type: continuous
    },
    {
        column_index: 2
        column_type: continuous
    }
]

And an input with any other number of output columns in the output_columns array can throw an error.

I think this would work, and is fairly neat, but it's not very general. And if I understood your suggestion correctly, amounts to about the same result?

cheemcheem commented 6 years ago

It amounts to being very close but not exactly what we need.

Making it an array is a good idea, fair enough. However Cox's Regression may require a start date column, end date column and event column at the same time:

output_columns: [
    {
        column_index: int /
        column_type: continuous
    }
    {
        column_index: int
        column_type: continuous
    }
    {
        column_index: int
        column_type: continuous
    }
]

Cox's Regression requires:

The duration column or two date columns
The event column

If it is size three it will contain two date columns and the event column. If it size two then it will contain a duration column and the event column. If it is size one it is just an output column and cannot be used with Cox's Regression.

So I'm suggesting do what you just said but with a maximum of three columns.

db213 commented 6 years ago

Granted I've only done cursory reading about Cox's Regression, so just correct me if I'm wrong about something, but this definitely doesn't feel right...

Cox's Regression is used to predict the survival time of patients and correlates it to one or more variables (e.g. age or gender). So in the API this would imply that the target data is the time until the event (e.g. death) occurs. So output_columns should only have one or two columns (for the situation you described in the issue) to find duration. Cox's Regression does not output a boolean of whether of not the event occurred, which is what including the event column in output_columns would imply. What it outputs is a function that gives the probability of an event occurring given an amount of time. I would have thought the event column should be in input_columns instead. We can add a convention that the first column in the input_columns array for a Cox's Regression job is the event column.

Also, I've failed to find any sample data we've been supplied that has the start time and end time like you described. Do you mind sending it so I can take a look?

hewa55 commented 6 years ago

Okay so at it seems, we only need two columns which are one of the following two:

1) duration column, event column 2) start_time column, event_time column

In 2, the duration can be deduced from the input and that an event has happened can be deduced from the event_time column. However for case 1, it needs to be ensured that duration column refers to the event, but I think that can be left to the user.

This could be handled through the Datatypes enums in the swagger and the output columns can be made a list with either 1 or 2 elements contained.

Does this suggestion make sense?

db213 commented 6 years ago

Makes sense but I would still argue we're putting the event column in the wrong field.

For case 1, I don't think the event column should be in output_columns field. It ought to be part of the input, with the other inputs. Cox's regression is predicting the duration, not whether or not the event happens. And yeah I agree it's the user's responsibility to select the relevant duration column.

For case 2, it could be deduced if the event happened or not, but doesn't change the fact that the user might still want to pass in an event column (if it exists). If they do, it should be passed in input_columns again. But like you said, it shouldn't be necessary to pass in at all as it can be deduced. It's hard to speculate as I haven't seen any data with this format. (e.g. how do we know what event_time values look like if the event hasn't happened? Is it always going to be the same so we always know what to look for?)

I think output_columns ought to be a list of <= 2 elements, where 2 elements is only used for case 2. And the event column ought to be passed into input_columns. Passing it through output_columns breaks the representation of what that field is supposed to mean.

db213 commented 6 years ago

Have made output_column into output_columns which takes an array of column objects, instead of just one. Convention has been defined that event column is passed in the input_columns as the first object in that array.

vv20 commented 6 years ago

Ok so in survival analysis there is something called "censorship"- whether the event was observed or not. It can have quite a distinct effect on the resulting survival/hazard function so completely ignoring it would be very wrong and our client would probably not be in love with that idea. Censorship can also not necessarily be deduced from the end date being in the data. For example, if we're talking about predicting the survival time of cancer patients, even if a death has occurred (so there IS an end date), it could be from a source unrelated to the study (i.e. not cancer). Therefore the subject is sensored, but according to this API model, the ML server has to treat it as uncensored. What is required is another possible configuration of the output_columns field, in which there is a boolean "censorship" column, in addition to either one continuous column (durations) or two continuous columns (start time and end time).

cheemcheem commented 6 years ago

Ohhh baby I like the way you think

db213 commented 6 years ago

Touching on this subject again: the censorship of a result is not part of the output of Cox's Regression and should not be part of the output_columns field. How about a solution like this (which also does not alter the API at all so doesn't make the call increasingly customised to Cox's Regression when it's supposed to be generic):

Both censorship and event columns are part of the input_columns, in any order the user likes. However, the parameters of Cox's Regression can include two required integer fields event_column_index and censorship_column_index which specify at what index in the array of input_columns those two columns are located (since in JSON, arrays maintain order).

This solution requires no changes to the API and I don't see any reason why it shouldn't work.

vv20 commented 6 years ago

"Event" and "censorship" are the same thing, but that's a minor point. Why pass an index of the column if you could just pass the column itself? Seems a bit backwards, especially because you would not want to include the censorship/event column in the covariates that you are training Cox regression on, so why would you shove everything in the same pot?

db213 commented 6 years ago

There is no problem with passing the column, just pass it in like you pass in all the other columns - the problem is with identifying it and separating it as it's own distinct thing (from all the other columns you've passed in, be it if you chose to pass it with output_columns or input_columns, it doesn't matter, you still need to distinguish it somehow). My suggestion is to just put a parameter into the Cox's Regression job which specifies which column is the event column. (If I understand correctly) Your suggestion is to add the event column to output_columns, possibly with some boolean attribute to identify it as the event column? They amount to exactly the same thing (identifying the event column) but one requires a protocol change and the other is already supported in the protocol.

You say that the event column should not be in the same "pot" as the input columns, but equally it should not be put in the same pot as the output columns. It is neither, but I think it fits better in input, since it is part of what the job needs to consider with the input, and is not something that the job outputs.

Edit: I just wanted to note two things: I had already discussed a solution to this issue - adding a convention that the first column in input_columns for a Cox's Regression job was the event column. That is still a perfectly good solution, I think, but I also thought of the parameter idea, which could also work.

Secondly, I would note that just adding the event column to output_columns causes more problems than it sorts. You will no longer immediately know upon receiving the call which two of the three columns in output_columns are the start/end columns. Neither will you know if you've received an event column or not in a request with two columns (is it a start and end times column, or is it a duration and event column?). Basically, you still end up with the exact same problem of identifying which column in output_columns is the event column. And again, you solve this issue by either using convention or a parameter.

So unless you're objecting to the way the column is identified (i.e. by convention or parameter), then I believe your issue is solely with whether it should be passed in through input_columns or output_columns? Please correct me if I'm misunderstanding you.

cheemcheem commented 6 years ago

Also another point: if cox's regression is passed date/time columns instead of a duration column then it needs to know what the date/time format it was given is. Unless we want to explicitly say all dates/times must be in a certain format.

db213 commented 6 years ago

Yeah, I thought of that issue when we first discussing this. Honestly, if we're going to put restraints on the data anyway, is it such a bad idea to just say "Cox's Regression only works with a duration column" and just not support the case where we take the difference between two columns? I was not at the meeting with Peter, but it sounded like he was not greatly invested in that working and it's much more complicated than it appears at first or is worth.

Also are we okay with the idea of the event column being in input_columns and identified by a parameter, or is this still an issue?

CS3099JH2017 / cs3099jh

Problem with handling Date columns and multiple columns for Cox's Regression #28