Open limhasic opened 6 months ago
@limhasic Hello! This repo is no longer maintained. However, I'd like to understand the issues you've raised so that I can try to address them. :)
First of all, thank you for your reply.
What I followed was this part, (Relational_REaLTabFormer_Experiments.ipynb) You wrote pd.read_csv(train_users_2.csv.zip), but I think you meant to unzip it.
air_out_df = a_sessions[a_sessions["user_id"].isin(air_in_df["user_id"].tolist())]
An error occurred in this part, so I checked and found out that the data was different.
While testing hma vs realtabformer, based on hotel data Generated synthetic data. Of course, I know that I used too little data, so I will test it with other datasets, but I thought that missing values would also be learned according to what was written in the paper, but that didn't seem to be the case. I have some doubts about this
[in code]
pip install realtabformer import os import pandas as pd from pathlib import Path from realtabformer import REaLTabFormer
parent_df =real_data['hotels']
child_df = real_data['guests']
join_on = "hotel_id"
Make sure that the key columns in both the parent and the child table have the same name. assert ((join_on in parent_df.columns) and (join_on in child_df.columns))
Non-relational or parent table. Don't include the unique_id field. parent_model = REaLTabFormer(model_type="tabular", epochs = 1) parent_model.fit(parent_df.drop(join_on, axis=1))
pdir = Path("rtf_parent/") parent_model.save(pdir)
parent_model_path = sorted([ p for p in pdir.glob("id*") if p.is_dir()], key=os.path.getmtime)[-1]
child_model = REaLTabFormer( model_type="relational", parent_realtabformer_path=parent_model_path, output_max_length=1024, train_size=0.8)
child_model.fit( df=child_df, in_df=parent_df, join_on=join_on)
parent_samples = parent_model.sample(len(parent_df))
Create the unique ids based on the index.
parent_samples.index.name = join_on parent_samples = parent_samples.reset_index()
child_samples = child_model.sample( input_unique_ids=parent_samples[join_on], input_df=parent_samples.drop(join_on, axis=1), gen_batch=64)
[error occur]
qt_interval
will be set to qt_interval_unique=100.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:597: UserWarning: qt_interval adjusted from 100 to 2...
warnings.warn(
Bootstrap round: 6%
32/500 [00:04<01:17, 6.05it/s]_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
r = call_item()
File "/usr/local/lib/python3.8/dist-packages/joblib/externals/loky/process_executor.py", line 291, in call
return self.fn(*self.args, self.kwargs)
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 589, in call
return [func(*args, *kwargs)
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 589, in
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last) Cell In[50], line 24 21 # Non-relational or parent table. Don't include the 22 # unique_id field. 23 parent_model = REaLTabFormer(model_type="tabular", epochs = 1) ---> 24 parent_model.fit(parent_df.drop(join_on, axis=1)) 26 pdir = Path("rtf_parent/") 27 parent_model.save(pdir)
File /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:458, in REaLTabFormer.fit(self, df, in_df, join_on, resume_from_checkpoint, device, num_bootstrap, frac, frac_max_data, qt_max, qt_max_default, qt_interval, qt_interval_unique, distance, quantile, n_critic, n_critic_stop, gen_rounds, sensitivity_max_col_nums, use_ks, full_sensitivity, sensitivity_orig_frac_multiple, orig_samples_rounds, load_from_best_mean_sensitivity, target_col) 456 trainer.train(resume_from_checkpoint=resume_from_checkpoint) 457 else: --> 458 trainer = self._train_with_sensitivity( 459 df, 460 device, 461 num_bootstrap=num_bootstrap, 462 frac=frac, 463 frac_max_data=frac_max_data, 464 qt_max=qt_max, 465 qt_max_default=qt_max_default, 466 qt_interval=qt_interval, 467 qt_interval_unique=qt_interval_unique, 468 distance=distance, 469 quantile=quantile, 470 n_critic=n_critic, 471 n_critic_stop=n_critic_stop, 472 gen_rounds=gen_rounds, 473 resume_from_checkpoint=resume_from_checkpoint, 474 sensitivity_max_col_nums=sensitivity_max_col_nums, 475 use_ks=use_ks, 476 full_sensitivity=full_sensitivity, 477 sensitivity_orig_frac_multiple=sensitivity_orig_frac_multiple, 478 orig_samples_rounds=orig_samples_rounds, 479 load_from_best_mean_sensitivity=load_from_best_mean_sensitivity, 480 ) 482 del self.dataset 484 elif self.model_type == ModelType.relational:
File /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:607, in REaLTabFormer._train_with_sensitivity(self, df, device, num_bootstrap, frac, frac_max_data, qt_max, qt_max_default, qt_interval, qt_interval_unique, distance, quantile, n_critic, n_critic_stop, gen_rounds, sensitivity_max_col_nums, use_ks, resume_from_checkpoint, full_sensitivity, sensitivity_orig_frac_multiple, orig_samples_rounds, load_from_best_mean_sensitivity)
600 qt_interval = _qt_interval
602 # Computing this here before splitting may have some data
603 # leakage issue, but it should be almost negligible. Doing
604 # the computation of the threshold on the full data with the
605 # train size aligned will give a more reliable estimate of
606 # the sensitivity threshold.
--> 607 sensitivity_values = SyntheticDataBench.compute_sensitivity_threshold(
608 train_data=df,
609 num_bootstrap=num_bootstrap,
610 # Divide by two so that the train_data in this computation matches the size
611 # of the final df used to train the model. This is essential so that the
612 # sensitivity_threshold value is consistent with the val_sensitivity.
613 # Concretely, the computation of the distribution of min distances is
614 # relative to the number of training observations.
615 # The frac
in this method corresponds to the size of both the test and the
616 # synthetic samples.
617 frac=frac / 2,
618 qt_max=qt_max,
619 qt_interval=qt_interval,
620 distance=distance,
621 return_values=True,
622 quantile=quantile,
623 max_col_nums=sensitivity_max_col_nums,
624 use_ks=use_ks,
625 full_sensitivity=full_sensitivity,
626 sensitivity_orig_frac_multiple=sensitivity_orig_frac_multiple,
627 )
628 sensitivity_threshold = np.quantile(sensitivity_values, quantile)
629 mean_sensitivity_value = np.mean(sensitivity_values)
File /usr/local/lib/python3.8/dist-packages/realtabformer/rtf_analyze.py:718, in SyntheticDataBench.compute_sensitivity_threshold(train_data, num_bootstrap, test_size, frac, qt_max, qt_interval, distance, tsvd, return_values, quantile, max_col_nums, use_ks, full_sensitivity, sensitivity_orig_frac_multiple) 716 print("Using parallel computation!!!") 717 with joblib.Parallel(n_jobs=n_jobs) as parallel: --> 718 values = parallel( 719 joblib.delayed(bootstrap_innerloop)() 720 for in tqdm(range(num_bootstrap), desc="Bootstrap round") 721 ) 723 print("Sensitivity threshold summary:") 724 print(pd.Series(values).describe())
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1952, in Parallel.call(self, iterable)
1946 # The first item from the output is blank, but it makes the interpreter
1947 # progress until it enters the Try/Except block of the generator and
1948 # reach the first yield
statement. This starts the aynchronous
1949 # dispatch of the tasks to the workers.
1950 next(output)
-> 1952 return output if self.return_generator else list(output)
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch) 1592 yield 1594 with self._backend.retrieval_context(): -> 1595 yield from self._retrieve() 1597 except GeneratorExit: 1598 # The generator has been garbage collected before being fully 1599 # consumed. This aborts the remaining tasks if possible and warn 1600 # the user if necessary. 1601 self._exception = True
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1699, in Parallel._retrieve(self)
1692 while self._wait_retrieval():
1693
1694 # If the callback thread of a worker has signaled that its task
1695 # triggered an exception, or if the retrieval loop has raised an
1696 # exception (e.g. GeneratorExit
), exit the loop and surface the
1697 # worker traceback.
1698 if self._aborting:
-> 1699 self._raise_error_fast()
1700 break
1702 # If the next job is not ready for retrieval yet, we just wait for
1703 # async callbacks to progress.
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1734, in Parallel._raise_error_fast(self) 1730 # If this error job exists, immediatly raise the error by 1731 # calling get_result. This job might not exists if abort has been 1732 # called directly or if the generator is gc'ed. 1733 if error_job is not None: -> 1734 error_job.get_result(self.timeout)
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout) 730 backend = self.parallel._backend 732 if backend.supports_retrieve_callback: 733 # We assume that the result has already been retrieved by the 734 # callback thread, and is stored internally. It's just waiting to 735 # be returned. --> 736 return self._return_or_raise() 738 # For other backends, the main thread needs to run the retrieval step. 739 try:
File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:754, in BatchCompletionCallBack._return_or_raise(self) 752 try: 753 if self.status == TASK_ERROR: --> 754 raise self._result 755 return self._result 756 finally:
ValueError: Input contains NaN.
and
What I'm really thankful for is that although many commercial models came out after hma, you are the only one who released it as open source with an algorithmic explanation. Thank you.
Following in your footsteps, you are understanding and utilizing the realtabformer.
But there are many things I don't understand.
for example The expected value of a_users[a_users["id"].isin(users_ids.index)] appears in many ways in your records, but nothing in my data.