Closed mortezamg63 closed 1 year ago
Thanks for your interest in using TabLLM! Unfortunately, I currently don’t have access to my computer. I will come back to you on the 11th of April. Sorry for the delay.
Best, Stefan
It is OK. I look forward to hearing from you.
Thanks
Error: "cannot import name 'DatasetDict' from 'datasets' (unknown location)" Hello, I'm also trying to use tabllm. By the way, I encountered the error you may be experienced. I can't find that kind of folder. Can I ask your solution to that issue?
Hi @mortezamg63,
the error is due to the fact that datasets should be internally serialized as list before applying the tabletotext
or T0
model for serialization. It can be seen as a preparation. Hence, the regex expects a dashed list. Sorry for not making this clear.
The problem can be solved by adding a list serialization to the command line. The following command works for me to perform the tabetotext
serialization:
create_external_datasets.py --dataset income --list --tabletotext
@seygodin your problem seems to be an importing issue. Please check if you install all required dependencies. If you still encounter the problem, please open a separate issue.
Hope that helps!
@stefanhgm Thanks for your answer. I ran the code as the command you provided in your answer. But, I can not still figure out the issue in code. It is not working for tabletotext. I can see the variable notes gives me a list of dictionaries like the output below.
notes[0]
"{'age': '39', 'workclass': 'owner of a non-incorporated business, professional practice, or farm', 'education': 'finished 11th class', 'marital_status': 'married', 'occupation': 'transportation, communication, and other public utilities sector', 'relationship': 'Husband', 'race': 'White', 'sex': 'Male', 'capital_gain': '0', 'capital_loss': '0', 'hours_per_week': '40', 'native_country': 'United States', 'label': 'False'}"
Lines 186 to 189 show the part that I can not see any serialization on notes. The lines are shown below as well.
186: old_size = len(notes)
187: notes = Dataset.from_dict({'text': list(itertools.chain(*[table_to_text(n) for n in notes]))})
188: assert notes.shape[0] == num_features * old_size
189: notes = notes.map(serialize)
What I understand from the code above is that the serialization in line 189 is called after the table_to_text function in line 187. In addition, as I mentioned and shown above, the variable notes is not a dashed-list. Can you show me how the dashed list should be? I can go for generating that format.
Thanks for giving time to me
Hi @mortezamg63
Sorry for the late reply! Again, I cloned the repo as is and only adjusted the data_dir
and output_dir
and the below command runs as expected:
create_external_datasets.py --dataset income --list --tabletotext
Re serialization: Depending on the LLM used for serialization (Narrativaai/bloom-560m-finetuned-totto-table-to-text
, bigscience/T0
, GPT-3
) a different format is required. For Narrativaai/bloom-560m-finetuned-totto-table-to-text
an html with a highlighted cell is required (see the model's documentation). For this, we actually created a single table with a single cell for each value and use the model to serialize this into a sentence because o.w. we experienced that the model leaves out information. The serialization, i.e. the dict in the notes
dataset, should look like this (for income):
{
'text': ['<s><table> <row> <highlighted_cell> Age </highlighted_cell> </row> <row> <highlighted_cell> 39 <col_header> Age </col_header> </highlighted_cell> </row> </table>\n\n', '<s><table> <row> <highlighted_cell> Race </highlighted_cell> </row> <row> <highlighted_cell> White <col_header> Race </col_header> </highlighted_cell> </row> </table>\n\n', [...]
}
For bigscience/T0
(using --t0serialization
) we created a dashed list with two items and ask the LLM to serialize it into a sentence. We found that this leads to the least information loss. So the dict would look like this:
{
'text': ['Write this information as a sentence: Age: 39, Race: White. \n', 'Write this information as a sentence: Sex: Male, Marital status: married. \n', [...]
}
For GPT-3
we simply serialized the dataset as a dashed list (just as for T0 but will all features in the list) and used an external program to query the OpenAI API.
Hope that help!
I close this issues since it seems resolved.
Hi
I am trying to run the code in create_external_datasets.py file for income dataset using t0serialization or tabletotext with debug arguments. But in , the function table_to_text has a regular expression that returns an empty list (around line 180 in file). I think there is the same problem in entry_to_text function.
I can not find the issue to run the code. Could you please let me know is that problem with template or something else? Thanks