Closed ashishakkumar closed 5 months ago
Hi! If your goal is to encode an entire piece of text, there's no need to specify --fields
under input or encoder. Simply aggregating everything into the contents
and stripping away any delimiters through your preprocessing scripts should suffice.
The parameter encoder --fields <fields supported by your encoder>
is used to direct the encoder on which inputs to consider. For instance, tct-colbert can process texts
(the default argument for Pyserini text encoders) as well as titles
. Various encoders might accept different types of input fields, provided they have been appropriately trained on such data. For example, unicoil can also process expands
.
Got it. Thanks.
I have a jsonl file containing dictionaries with entries like this :
{'id': 'NCT01740609', 'contents': "A Study To Assess The Safety Of PF-06342674 In Healthy Volunteers@&The purpose of this study is to evaluate the safety, tolerability, pharmacokinetics and immunogenicity of single escalating doses PF-06342674.@&None@&COMPLETED@&['Healthy']@&ALL@&False@&18 Years@&None@&['Phase 1', 'RN168', 'Healthy Volunteers']@&None@&None@&None@&Inclusion Criteria:\n\n Male subjects and female of non-childbearing potential subjects between the ages of 18 and 55.\n BMI between 18.5 to 32 kg/m2.\n Total body weight ≥40 kg and ≤120 kg.\n\nExclusion Criteria:\n\n Previous treatment with an antibody within 6 months prior to Day 1.\n Pregnant or nursing females; females of childbearing potential.\n History of sensitivity to heparin or heparin-induced thrombocytopenia.@&ALL@&None@&None@&None@&None@&2014-06@&COMPLETED"}
I am trying to encode the document (jsonl) using the Dense Encoder :
python -m pyserini.encode input --corpus transformed_data.jsonl --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --delimiter "@&" --shard-id 0 --shard-num 1 output --embeddings pyserini_embeddings --to-faiss encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --batch 32 --device cpu
The error after running the above command is :
I tried to inspect the main.py in pyserini/encode , the parser for "field" argument is :
input_parser.add_argument('--fields', help='fields that contents in jsonl has (in order)', nargs='+', default=['text'], required=False)
After this parsing,It means that the collection iterator mandatorily expects "text" field, can store "title" field and "expands" field. Is it possible to expand it to any number of desired fields? Thanks!