clinicalml / TabLLM

MIT License
265 stars 42 forks source link

The table_to_text function returns an empty list #1

Closed mortezamg63 closed 1 year ago

mortezamg63 commented 1 year ago

Hi

I am trying to run the code in create_external_datasets.py file for income dataset using t0serialization or tabletotext with debug arguments. But in , the function table_to_text has a regular expression that returns an empty list (around line 180 in file). I think there is the same problem in entry_to_text function.

def table_to_text(note):
         re_name_value = re.compile(r"^- (.*):([^:]*)$", re.MULTILINE)
         name_values = re_name_value.findall(note)                                # --->  name_values = []
         examples = [write_into_table(x[0].strip(), x[1].strip()) for x in name_values]
         return [preprocess(e)['linearized_table'] for e in examples]

I can not find the issue to run the code. Could you please let me know is that problem with template or something else? Thanks

stefanhgm commented 1 year ago

Thanks for your interest in using TabLLM! Unfortunately, I currently don’t have access to my computer. I will come back to you on the 11th of April. Sorry for the delay.

Best, Stefan

mortezamg63 commented 1 year ago

It is OK. I look forward to hearing from you.

Thanks

seygodin commented 1 year ago

Error: "cannot import name 'DatasetDict' from 'datasets' (unknown location)" Hello, I'm also trying to use tabllm. By the way, I encountered the error you may be experienced. I can't find that kind of folder. Can I ask your solution to that issue?

stefanhgm commented 1 year ago

Hi @mortezamg63,

the error is due to the fact that datasets should be internally serialized as list before applying the tabletotext or T0 model for serialization. It can be seen as a preparation. Hence, the regex expects a dashed list. Sorry for not making this clear.

The problem can be solved by adding a list serialization to the command line. The following command works for me to perform the tabetotext serialization:

create_external_datasets.py --dataset income --list --tabletotext

@seygodin your problem seems to be an importing issue. Please check if you install all required dependencies. If you still encounter the problem, please open a separate issue.

Hope that helps!

mortezamg63 commented 1 year ago

@stefanhgm Thanks for your answer. I ran the code as the command you provided in your answer. But, I can not still figure out the issue in code. It is not working for tabletotext. I can see the variable notes gives me a list of dictionaries like the output below.

notes[0]
"{'age': '39', 'workclass': 'owner of a non-incorporated business, professional practice, or farm', 'education': 'finished 11th class', 'marital_status': 'married', 'occupation': 'transportation, communication, and other public utilities sector', 'relationship': 'Husband', 'race': 'White', 'sex': 'Male', 'capital_gain': '0', 'capital_loss': '0', 'hours_per_week': '40', 'native_country': 'United States', 'label': 'False'}"

Lines 186 to 189 show the part that I can not see any serialization on notes. The lines are shown below as well.

186:  old_size = len(notes)
187:  notes = Dataset.from_dict({'text': list(itertools.chain(*[table_to_text(n) for n in notes]))})
188:  assert notes.shape[0] == num_features * old_size
189:  notes = notes.map(serialize)

What I understand from the code above is that the serialization in line 189 is called after the table_to_text function in line 187. In addition, as I mentioned and shown above, the variable notes is not a dashed-list. Can you show me how the dashed list should be? I can go for generating that format.

Thanks for giving time to me

stefanhgm commented 1 year ago

Hi @mortezamg63

Sorry for the late reply! Again, I cloned the repo as is and only adjusted the data_dir and output_dir and the below command runs as expected:

create_external_datasets.py --dataset income --list --tabletotext

Re serialization: Depending on the LLM used for serialization (Narrativaai/bloom-560m-finetuned-totto-table-to-text, bigscience/T0, GPT-3) a different format is required. For Narrativaai/bloom-560m-finetuned-totto-table-to-text an html with a highlighted cell is required (see the model's documentation). For this, we actually created a single table with a single cell for each value and use the model to serialize this into a sentence because o.w. we experienced that the model leaves out information. The serialization, i.e. the dict in the notes dataset, should look like this (for income):

{
'text': ['<s><table> <row> <highlighted_cell> Age </highlighted_cell> </row> <row> <highlighted_cell> 39 <col_header> Age </col_header> </highlighted_cell> </row> </table>\n\n', '<s><table> <row> <highlighted_cell> Race </highlighted_cell> </row> <row> <highlighted_cell> White <col_header> Race </col_header> </highlighted_cell> </row> </table>\n\n', [...]
}

For bigscience/T0 (using --t0serialization) we created a dashed list with two items and ask the LLM to serialize it into a sentence. We found that this leads to the least information loss. So the dict would look like this:

{
'text': ['Write this information as a sentence: Age: 39, Race: White. \n', 'Write this information as a sentence: Sex: Male, Marital status: married. \n', [...]
}

For GPT-3 we simply serialized the dataset as a dashed list (just as for T0 but will all features in the list) and used an external program to query the OpenAI API.

Hope that help!

stefanhgm commented 1 year ago

I close this issues since it seems resolved.