ebi-ait / checklist

Template repository for checklists
Apache License 2.0
1 stars 0 forks source link

run all scripts and perform comparison #25

Closed amnonkhen closed 3 weeks ago

amnonkhen commented 1 month ago

setup python

cd /nfs/production/tburdett/workstreams/fairification/checklists
mkdir [sw,data]
export PYTHONHOME=/hps/software/jupyterhub
export PATH=$PATH:$PYTHONHOME/bin
python --version
cd sw
python -mvenv .venv
cd -
. .venv/bin/activate
# get webin username and password into ENA_USER and ENA_PASSWORD variables in .env file
. .env

clone scripts

cd /nfs/production/tburdett/workstreams/fairification/checklists
git clone https://github.com/ebi-ait/checklist-converter.git

get xmls

If running for the first time, run the script to get the xmls from ENA, otherwise, just copy from the previous run directory. This step can be skipped if nothing is changed in the ENA documents or validation since the previous run.

cd checklist-converter
cut -f2 -d, data/accessions.csv | tail -n+2 | xargs -n1 -t python src/retrieve_xml_from_bsd_accession.py -o /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xmls/ -a

get jsons

If running for the first time, run the script to get the jsons from ENA, otherwise, just copy from the previous run directory.

cut -f2 -d, data/accessions.csv | tail -n +2 | xargs -n1 -t -I{} python src/ena_sample_json_retriever.py --out_file /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/jsons/{} --accession {} --user $ENA_USER --password $ENA_PASSWORD

validate using ena xml validation

python src/validate_xml_against_ena_dev.py --input /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xmls/ --out_dir /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_validation/ --user $ENA_USER --password $ENA_PASSWORD

validate using biovalidator

use src/validate_biovalidator.py In one window (on the local computer):

cd checklist-converter
docker-compose up

Get your external IP address (usually 10.x.y.z) using ifconfig.

in another window (in codon cluster):

export BIOVALIDATOR_SERVICE=address-of-biovalidator
tail -n +2 data/accessions.csv | awk -F, '{print "--schema schema/"$1"-ENA.json" " --data /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/jsons/"$2}' | xargs -L 1 -t python src/validate_json_using_biovalidator.py --output_dir /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/json_validation

compare results

run 001

check results of ena xml validation

Skip this step if this run involves only changes in json validation.

echo submission server errors
find /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_validation/500 -type f | wc -l
echo submission document errors
find /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_validation/200 -type f | xargs -n1 grep "The object being added already exists" | wc -l
echo validation errors
find /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_validation/200 -type f | xargs grep "Error validating" | wc -l

Comparison:

find /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_validation/200 -type f | xargs -n1 grep -l "The object being added already exists" | sort  | awk -F/ '{print $NF}' | cut -d. -f1 > /nfs/production/tburdett/workstreams/fairification/checklists/data/run001/xml_valid.txt

json validation

\find /nfs/production/tburdett/workstreams/fairification/checklists/data/run004/json_validation -type f | xargs -n1 python src/check_json_validation_results.py > /nfs/production/tburdett/workstreams/fairification/checklists/data/run004/json_validation_results.csv
amnonkhen commented 1 month ago

runs 001 and 002 discovered 2 problems: