:toc: auto
This project creates datasets for machine learning from the US Congress bills. Currently supports creation of following datasets:
To build docker image required for project to work you can use manual docker command or following command:
export DOCKER_IMAGE_TAGS=billml:1.0.0; . ./build_docker_image.sh
This chapter contains a table with explanation of the environment variables that used in the project's docker container.
|===
|Variable name | Default value | Description
|BILLML_IMAGE_NAME
|dreamproit/billml
|The name of billml
image that will be used in docker-compose.yaml
.
|BILLML_IMAGE_VERSION
|1.0.0
|The version of billml
image that will be used in docker-compose.yaml
.
|BILLML_CONTAINER_NAME
|billml
|The name of the billml
service container in docker-compose.
|BILLML_PLATFORM_NAME
|linux/x86_64
|The name of the platform that will be used in billml
docker compose service.
|BILLML_RUN_FOR_EVER
|False
|Variable used in billml
's entrypoint.sh
to keep container running indefinitely (very handy in development).
|BILLML_LOCAL_VOLUME_PATH
|/path/to/repo/root
|Variable used dev docker-compose to mount local volume to the container root.
|BILLML_LOG_OUTPUT_FOLDER
|/usr/src/logs
|Path to folder where project will save .log files and .gz archives with logs archives.
|BILLML_LOCAL_FILESYSTEM_CONCURRENCY_LIMIT
|4000
|Amount of the concurrent tasks required for dataset creation.(In case of OSError
lower this number)
|BILLML_DATASETS_STORAGE_FILEPATH
|/datasets_data
|Path to folder where project will save datasets.
|CONGRESS_DATA_FOLDER_FILEPATH
|/bills_data/data
|Path to folder where https://github.com/dreamproit/congress[congress] project saves bills data.
|CONGRESS_CACHE_FOLDER_FILEPATH
|/bills_data/cache
|Path to folder where https://github.com/dreamproit/congress[congress] project saves cache data.
|BILLML_BACKUP_BILLS_DATA_CLOUD_PROVIDER
|s3
|The name of cloud provider to store bills data.
|BILLML_BACKUP_BILLS_DATA_BUCKET_NAME
|your-bucket-name
|The name of AWS s3 bucket.
|BILLML_BACKUP_BILLS_DATA_ACCESS_KEY
|your-aws-access-key
|The AWS access key for your s3 bucket.
|BILLML_BACKUP_BILLS_DATA_SECRET_KEY
|your-aws-secret-key
|The AWS secret key for your s3 bucket.
|BILLML_BACKUP_DATASET_DATA_CLOUD_PROVIDER
|s3
|The name of cloud provider to store datasets data.
|BILLML_BACKUP_DATASET_DATA_BUCKET_NAME
|your-bucket-name
|The name of AWS s3 bucket.
|BILLML_BACKUP_DATASET_DATA_ACCESS_KEY
|your-aws-access-key
|The AWS access key for your s3 bucket.
|BILLML_BACKUP_DATASET_DATA_SECRET_KEY
|your-aws-secret-key
|The AWS secret key for your s3 bucket.
|BILLML_BACKUP_DATASET_HF_SECRET_KEY
|your-hugging-face-token
|The hugging face https://huggingface.co/docs/hub/security-tokens[token] required for upload datasets to hugging face utility script to work.
|===
.env
fileUse make command to create .env file
make env_setup
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS"
NOTE: You can add --congress
parameter to download other congress(es) bills. The data will be stored in CONGRESS_DATA_FOLDER_FILEPATH
folder.
fdsys_billstatus.xml
files to data.json
and data.xml
files CLI commanddocker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS"
NOTE: You can add --congress
parameter to download other congress(es) bills.
text-versions
CLI commanddocker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf"
NOTE: You can add --congress
parameter to download other congress(es) bills as well as --extract
parameter to extract other formats.
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/tasks/check_local_files.py"
This script provides information about local filesystem:
Also without skipping script steps the script will download all missing BILL_STATUS and bills data.(Very handy for updating local bills data)
To create "bill_summary_us" dataset from scratch you need to do following steps with help of https://github.com/dreamproit/congress[congress] project:
fdsys_billstatus.xml
files)fdsys_billstatus.xml
files to data.json
and data.xml
filestext-versions
The steps above described in the "Project setup" chapter of this readme.
Use following command to create "bill_summary_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'"
|===
|Parameter name | Default value | Description
|--dataset_names
|None
|The name dataset user wants to create.
|--sections_limit
|None
|The number of sections that bill should have to be included in the dataset.
We will include all bills with number of sections more or equal sections_limit.
|--congresses_to_include
|None
|Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.
|--bill_types_to_include
|['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]
|Bill types user want to include in the dataset(s) if no value provided all bill types will be included.
|===
To create "bill_text_us" dataset from scratch you need to do following steps with help of https://github.com/dreamproit/congress[congress] project:
text-versions
The steps above described in the "Project setup" chapter of this readme.
Use following command to create "bill_text_us" dataset. The dataset creation heavily rely on bills data presence. So if you setup project first time you have to make sure that you did previous steps correctly and bills data is present.
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'"
|===
|Parameter name | Default value | Description
|--dataset_names
|None
|The name dataset user wants to create.
|--sections_limit
|None
|The number of sections that bill should have to be included in the dataset.
We will include all bills with number of sections more or equal sections_limit.
|--congresses_to_include
|None
|Number of congresses user want to include in the dataset. If no value provided all congresses available in the filesystem will be included.
|--bill_types_to_include
|['hconres', 'hjres', 'hr', 'hres', 's', 'sconres', 'sjres', 'sres',]
|Bill types user want to include in the dataset(s) if no value provided all bill types will be included.
|===
You can setup credentials to upload locally downloaded bills data to cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/upload_bills_data.sh"
You can setup credentials to download(previously downloaded via project) bills data from cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/download_bills_data.sh"
NOTE: You must setup ENV variables with your s3 bucket name and keys.
You can setup credentials to upload locally created datasets data to cloud storage (AWS s3 currently supported). To do that use following command:
docker exec billml_compose-billml bash -c ". utils/upload_datasets_data.sh"
This chapter explains how to set up scheduled creating and uploading fresh datasets to hugging face repo. The following steps described as separate chapters in this readme.
Setup crontab schedulled steps to create datasets
After you setup ENV variables in .env file. You can use crontab -e command to edit crontab file and add backup script. Example of recommended schedule (At 23:11 on Sunday).
11 23 * * SUN docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/congress/core/run.py govinfo --bulkdata=BILLSTATUS; \
poetry run python /usr/src/app/congress/core/run.py bills --collections=BILLS; \
poetry run python /usr/src/app/congress/core/run.py govinfo --collections=BILLS --extract=xml,pdf; \
poetry run python /usr/src/app/congress/core/tasks/check_local_files.py; \
poetry run python /usr/src/app/main.py --dataset_names='bill_summary_us'; \
poetry run python /usr/src/app/main.py --dataset_names='bill_text_us'; \
. utils/upload_bills_data.sh; \
. utils/upload_datasets_data.sh
"
After schedulled run you can use command to upload datasets to hugging face:
# find name of created 'bill_summary_us' dataset
ls /datasets_data/bill_summary_us
# use utility script to upload dataset, for example:
docker exec billml_compose-billml bash -c "poetry run python /usr/src/app/utils/upload_dataset_to_hf.py --source_dataset_filepath=/datasets_data/bill_summary_us/bill_summary_us_11-10-2023_13-02-35.jsonl --hf_dataset_name=dreamproit/bill_summary_us --hf_path_in_repo=bill_summary_us.jsonl"
Example .jsonl
dataset files stored in repo samples
folder. Each example dataset file have 10 items.