AgResearch / gbs_prism

refactored GBS processing
0 stars 1 forks source link

port keyfile sanitising to gquery #45

Open afmcc opened 8 months ago

afmcc commented 8 months ago

formatting problems with externally supplied ("walkin" ) keyfiles, such as ragged ends and non-ascii characters, was previously handled by the script in this repo, which was called by But that bash script has been deprecated as the keyfile import now handled by gquery/gupdate. So need to port code from into the module of gupdate , around line 1797 - i.e. this code :

    with open(get_predicate("walkins_file"),"r") as walkins:
        walkin_columns = None
        walkin_records = []
        for rec in walkins:
            if walkin_columns is None:
                walkin_columns = [ item.lower() for item in re.split("\t",rec.strip()) ]
                fields = re.split("\t",rec)
                if len(fields) != len(walkin_columns):
                    raise illumina_sequencing_exception("number of fields in in header of walkins file %s (%d) is not always the same as in the rest of the file (%d)"%
                                                        ( get_predicate("walkins_file"), len(walkin_columns),len(fields)))  
                walkin_records.append(dict(zip(walkin_columns, fields)))
afmcc commented 8 months ago

patch is so we don't see this kind of thing . . ..

running gupdate --explain -t create_gbs_keyfiles -p "fastq_folder_root=/dataset/2023_illumina_sequencing_c/scratch/postprocessing/illumina/novaseq;run_folder_root=/dataset/2023_illumina_sequencing_c/active;out_folder=/dataset/hiseq/active/key-files;sample_sheet=/dataset/2023_illumina_sequencing_c/active/240109_A01439_0232_AHNGHFDRX3/SampleSheet.csv;import" all


* oops something went wrong :(

* The original exeption encountered is below. To help debug the

* problem, a log of this session is here :

* /dataset/genophyle_data/scratch/gupdate/all-job.7.log


Traceback (most recent call last): File "/dataset/gseq_processing/active/bin/gquery/", line 378, in sys.exit(main()) File "/dataset/gseq_processing/active/bin/gquery/", line 324, in main illumina.illumina(s).create_gbs_keyfiles() File "/bifo/active/gseq_processing/bin/gquery/sequencing/", line 127, in create_gbs_keyfiles platform.create_gbs_keyfiles() File "/bifo/active/gseq_processing/bin/gquery/sequencing/", line 1685, in create_gbs_keyfiles columns=self.create_gbs_keyfile(parameters_dict, key_path, append = False) File "/bifo/active/gseq_processing/bin/gquery/sequencing/", line 1726, in create_gbs_keyfile walkin_columns = self.create_or_append_external_gbs_keyfile(predicates, key_path, append_existing) File "/bifo/active/gseq_processing/bin/gquery/sequencing/", line 1771, in create_or_append_external_gbs_keyfile ( get_predicate("walkins_file"), len(walkin_columns),len(fields))) sequencing.illumina.illumina_sequencing_exception: number of fields in in header of walkins file /dataset/hiseq/active/key-files/SQ3047.txt (15) is not always the same as in the rest of the file (27)

sorry - quitting after received bad return code from database import -try looking at the log file shown above