Closed NitinAggarwal1 closed 3 months ago
@NitinAggarwal1 ,
Thanks for your interested in our WSDM 2024 paper.
First, i recommend you take a closer look at our data preprocessing script [1] as well as the original NCI Github Repo [2]:
In summary, as described in the comments of the following code snippet from pefa_xs.py
:
X_trn = smat_util.load_matrix(f"{input_emb_dir}/X.trn.npy") # trn set emb from real query text
X_tst = smat_util.load_matrix(f"{input_emb_dir}/X.tst.npy") # tst set emb from real query text
X_abs = smat_util.load_matrix(f"{input_emb_dir}/X.trn.abs.npy") # trn set emb from doc's abstract+title text
X_doc = smat_util.load_matrix(f"{input_emb_dir}/X.trn.doc.npy") # trn set emb from doc's content (first 512 tokens)
X_d2q = smat_util.load_matrix(f"{input_emb_dir}/X.trn.d2q.npy") # trn set emb from docT5query using doc's content
Q1: How to we get the document embedding (derived from its title+abstract)?
X_abs
is the training set embeddings derived from document's abstract + title text Y_abs
is the diagonal doc-to-doc label matrixP_emb = Y_abs.T.dot(X_abs)
, which essentially is the document embeddings of abstract+title text.Q2: How to get additional augmented data source? Similar to NCI paper, we get their pre-processed data augmentation of
X_doc
: the document embeddings from its full contentX_d2q
: the document embeddings from pseudo queries generated by a Seq2Seq modelQ3: What's the difference between X_trn
and X_abs
:
X_trn
is training set query embeddings (derived from query keywords)X_abs
is training set document embeddings derived from the document's abstract + title text.I hope these FAQs answer most of your questions. If you still have other questions, feel free to ask.
Reference
I have a conceptual question regarding the PEFA paper at WSDM 24 . I was able to recreate the results using the NQ320K dataset.
“”" logging.info("Gathering data augmentation..”) P_emb = LabelEmbeddingFactory.create(Y_abs, X_abs, method="pifa", normalized_Y=False)
X_aug, Y_aug = get_data_aug(X_trn, X_doc, X_d2q, Y_trn, Y_doc, Y_d2q, aug_type="v5") logging.info("Running PEFA-XS..”) run_pefa_xs(P_emb, X_aug, Y_aug, X_tst, Y_tst, lambda_erm=lambda_erm) “”” In the above part of code in pefa_xs.py :
Am I missing something about the natural questions data here ?
In general sense I would have queries and some labels in my corpus for my custom data . Query - Label for train ( This would create X_trn and Y_trn and create P_emb using the PIFA method.)
Query - Label for test . ( This would create X_tst and Y_tst and create the test set on which we run )
My question is what is the data PEFA uses for getting the second component pifa_emb ( In natural questions we are using Y_aug , X_aug to generate that which is nothing but embedding of queries and the label sparse matrix for those queries respectively)
What the difference between X_abs and X_trn ?
@OctoberChang