limhasic commented 6 months ago

Following in your footsteps, you are understanding and utilizing the realtabformer.

But there are many things I don't understand.

The number does not match the data in the kaggle as to where the Airbnb dataset came from.

for example The expected value of a_users[a_users["id"].isin(users_ids.index)] appears in many ways in your records, but nothing in my data.

There is an error about the missing value In the paper, it is said that missing data is also learned, but in reality, an error about missing values occurs in the code

avsolatorio commented 6 months ago

@limhasic Hello! This repo is no longer maintained. However, I'd like to understand the issues you've raised so that I can try to address them. :)

Could you please share the link to the Kaggle data you refer to?
Can you share more details about the error you are getting?

limhasic commented 6 months ago

First of all, thank you for your reply.

here is the kaggle link (https://www.kaggle.com/competitions/airbnb-recruiting-new-user-bookings/overview) I chose it because sessions.csv.zip and train_users_2.csv.zip were available.

What I followed was this part, (Relational_REaLTabFormer_Experiments.ipynb) You wrote pd.read_csv(train_users_2.csv.zip), but I think you meant to unzip it.

air_out_df = a_sessions[a_sessions["user_id"].isin(air_in_df["user_id"].tolist())]

An error occurred in this part, so I checked and found out that the data was different.

This is the part about missing values (https://colab.research.google.com/drive/1L6i-JhJK9ROG-KFcyzT9G-8FC3L8y8Lc?usp=sharing)

While testing hma vs realtabformer, based on hotel data Generated synthetic data. Of course, I know that I used too little data, so I will test it with other datasets, but I thought that missing values would also be learned according to what was written in the paper, but that didn't seem to be the case. I have some doubts about this

[in code]

REaLTabFormer 방식 합성데이터 생성

pip install realtabformer import os import pandas as pd from pathlib import Path from realtabformer import REaLTabFormer

parent_df =real_data['hotels']

parent_df = parent_df.fillna(0)

child_df = real_data['guests']

child_df = child_df.fillna(0)

join_on = "hotel_id"

Make sure that the key columns in both the parent and the child table have the same name. assert ((join_on in parent_df.columns) and (join_on in child_df.columns))

Non-relational or parent table. Don't include the unique_id field. parent_model = REaLTabFormer(model_type="tabular", epochs = 1) parent_model.fit(parent_df.drop(join_on, axis=1))

pdir = Path("rtf_parent/") parent_model.save(pdir)

parent_model_path = sorted([ p for p in pdir.glob("id*") if p.is_dir()], key=os.path.getmtime)[-1]

child_model = REaLTabFormer( model_type="relational", parent_realtabformer_path=parent_model_path, output_max_length=1024, train_size=0.8)

child_model.fit( df=child_df, in_df=parent_df, join_on=join_on)

Generate parent samples.

parent_samples = parent_model.sample(len(parent_df))

Create the unique ids based on the index.

parent_samples.index.name = join_on parent_samples = parent_samples.reset_index()

Generate the relational observations.

child_samples = child_model.sample( input_unique_ids=parent_samples[join_on], input_df=parent_samples.drop(join_on, axis=1), gen_batch=64)

[error occur]

Computing the sensitivity threshold... Using parallel computation!!! /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:77: UserWarning: The device=cuda is not available, using device=cpu instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:570: UserWarning: Duplicate rate (0.0) in the data is zero. The `qt_interval` will be set to qt_interval_unique=100. warnings.warn( /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:597: UserWarning: qt_interval adjusted from 100 to 2... warnings.warn( Bootstrap round: 6% 32/500 [00:04<01:17, 6.05it/s]

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker r = call_item() File "/usr/local/lib/python3.8/dist-packages/joblib/externals/loky/process_executor.py", line 291, in call return self.fn(*self.args, self.kwargs) File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 589, in call return [func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 589, in return [func(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/realtabformer/rtf_analyze.py", line 694, in bootstrap_inner_loop return SyntheticDataBench.compute_sensitivity_metric( File "/usr/local/lib/python3.8/dist-packages/realtabformer/rtf_analyze.py", line 552, in compute_sensitivity_metric test_distances: np.ndarray = distance(original, test) File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/_param_validation.py", line 214, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/pairwise.py", line 1046, in manhattan_distances X, Y = check_pairwise_arrays(X, Y) File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/pairwise.py", line 173, in check_pairwise_arrays Y = check_array( File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 957, in check_array _assert_all_finite( File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 122, in _assert_all_finite _assert_all_finite_element_wise( File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 171, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input contains NaN. """

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last) Cell In[50], line 24 21 # Non-relational or parent table. Don't include the 22 # unique_id field. 23 parent_model = REaLTabFormer(model_type="tabular", epochs = 1) ---> 24 parent_model.fit(parent_df.drop(join_on, axis=1)) 26 pdir = Path("rtf_parent/") 27 parent_model.save(pdir)

File /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:458, in REaLTabFormer.fit(self, df, in_df, join_on, resume_from_checkpoint, device, num_bootstrap, frac, frac_max_data, qt_max, qt_max_default, qt_interval, qt_interval_unique, distance, quantile, n_critic, n_critic_stop, gen_rounds, sensitivity_max_col_nums, use_ks, full_sensitivity, sensitivity_orig_frac_multiple, orig_samples_rounds, load_from_best_mean_sensitivity, target_col) 456 trainer.train(resume_from_checkpoint=resume_from_checkpoint) 457 else: --> 458 trainer = self._train_with_sensitivity( 459 df, 460 device, 461 num_bootstrap=num_bootstrap, 462 frac=frac, 463 frac_max_data=frac_max_data, 464 qt_max=qt_max, 465 qt_max_default=qt_max_default, 466 qt_interval=qt_interval, 467 qt_interval_unique=qt_interval_unique, 468 distance=distance, 469 quantile=quantile, 470 n_critic=n_critic, 471 n_critic_stop=n_critic_stop, 472 gen_rounds=gen_rounds, 473 resume_from_checkpoint=resume_from_checkpoint, 474 sensitivity_max_col_nums=sensitivity_max_col_nums, 475 use_ks=use_ks, 476 full_sensitivity=full_sensitivity, 477 sensitivity_orig_frac_multiple=sensitivity_orig_frac_multiple, 478 orig_samples_rounds=orig_samples_rounds, 479 load_from_best_mean_sensitivity=load_from_best_mean_sensitivity, 480 ) 482 del self.dataset 484 elif self.model_type == ModelType.relational:

File /usr/local/lib/python3.8/dist-packages/realtabformer/realtabformer.py:607, in REaLTabFormer._train_with_sensitivity(self, df, device, num_bootstrap, frac, frac_max_data, qt_max, qt_max_default, qt_interval, qt_interval_unique, distance, quantile, n_critic, n_critic_stop, gen_rounds, sensitivity_max_col_nums, use_ks, resume_from_checkpoint, full_sensitivity, sensitivity_orig_frac_multiple, orig_samples_rounds, load_from_best_mean_sensitivity) 600 qt_interval = _qt_interval 602 # Computing this here before splitting may have some data 603 # leakage issue, but it should be almost negligible. Doing 604 # the computation of the threshold on the full data with the 605 # train size aligned will give a more reliable estimate of 606 # the sensitivity threshold. --> 607 sensitivity_values = SyntheticDataBench.compute_sensitivity_threshold( 608 train_data=df, 609 num_bootstrap=num_bootstrap, 610 # Divide by two so that the train_data in this computation matches the size 611 # of the final df used to train the model. This is essential so that the 612 # sensitivity_threshold value is consistent with the val_sensitivity. 613 # Concretely, the computation of the distribution of min distances is 614 # relative to the number of training observations. 615 # The frac in this method corresponds to the size of both the test and the 616 # synthetic samples. 617 frac=frac / 2, 618 qt_max=qt_max, 619 qt_interval=qt_interval, 620 distance=distance, 621 return_values=True, 622 quantile=quantile, 623 max_col_nums=sensitivity_max_col_nums, 624 use_ks=use_ks, 625 full_sensitivity=full_sensitivity, 626 sensitivity_orig_frac_multiple=sensitivity_orig_frac_multiple, 627 ) 628 sensitivity_threshold = np.quantile(sensitivity_values, quantile) 629 mean_sensitivity_value = np.mean(sensitivity_values)

File /usr/local/lib/python3.8/dist-packages/realtabformer/rtf_analyze.py:718, in SyntheticDataBench.compute_sensitivity_threshold(train_data, num_bootstrap, test_size, frac, qt_max, qt_interval, distance, tsvd, return_values, quantile, max_col_nums, use_ks, full_sensitivity, sensitivity_orig_frac_multiple) 716 print("Using parallel computation!!!") 717 with joblib.Parallel(n_jobs=n_jobs) as parallel: --> 718 values = parallel( 719 joblib.delayed(bootstrap_innerloop)() 720 for in tqdm(range(num_bootstrap), desc="Bootstrap round") 721 ) 723 print("Sensitivity threshold summary:") 724 print(pd.Series(values).describe())

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1952, in Parallel.call(self, iterable) 1946 # The first item from the output is blank, but it makes the interpreter 1947 # progress until it enters the Try/Except block of the generator and 1948 # reach the first yield statement. This starts the aynchronous 1949 # dispatch of the tasks to the workers. 1950 next(output) -> 1952 return output if self.return_generator else list(output)

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch) 1592 yield 1594 with self._backend.retrieval_context(): -> 1595 yield from self._retrieve() 1597 except GeneratorExit: 1598 # The generator has been garbage collected before being fully 1599 # consumed. This aborts the remaining tasks if possible and warn 1600 # the user if necessary. 1601 self._exception = True

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1699, in Parallel._retrieve(self) 1692 while self._wait_retrieval(): 1693 1694 # If the callback thread of a worker has signaled that its task 1695 # triggered an exception, or if the retrieval loop has raised an 1696 # exception (e.g. GeneratorExit), exit the loop and surface the 1697 # worker traceback. 1698 if self._aborting: -> 1699 self._raise_error_fast() 1700 break 1702 # If the next job is not ready for retrieval yet, we just wait for 1703 # async callbacks to progress.

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:1734, in Parallel._raise_error_fast(self) 1730 # If this error job exists, immediatly raise the error by 1731 # calling get_result. This job might not exists if abort has been 1732 # called directly or if the generator is gc'ed. 1733 if error_job is not None: -> 1734 error_job.get_result(self.timeout)

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout) 730 backend = self.parallel._backend 732 if backend.supports_retrieve_callback: 733 # We assume that the result has already been retrieved by the 734 # callback thread, and is stored internally. It's just waiting to 735 # be returned. --> 736 return self._return_or_raise() 738 # For other backends, the main thread needs to run the retrieval step. 739 try:

File /usr/local/lib/python3.8/dist-packages/joblib/parallel.py:754, in BatchCompletionCallBack._return_or_raise(self) 752 try: 753 if self.status == TASK_ERROR: --> 754 raise self._result 755 return self._result 756 finally:

ValueError: Input contains NaN.

and

What I'm really thankful for is that although many commercial models came out after hma, you are the only one who released it as open source with an algorithmic explanation. Thank you.

avsolatorio / REaLTabFormer-Experiments

Abandoned github? #1

REaLTabFormer 방식 합성데이터 생성

parent_df = parent_df.fillna(0)

child_df = child_df.fillna(0)

Generate parent samples.

Generate the relational observations.