Some fastq files have the wrong flow cell as tag or in their file path

diitaz93 commented 1 year ago

Description

After the addition of flow cell ids as tags to fastq and spring, some fastqs have a mismatch between their tag and the flow cell in their name. The cause of this is not certain but it was most likely because the flow cell in status db, which was fetched for the tag assignment was incorrect.

Solution

I have identified all fastq files in which this happens and consolidated a csv file called 01_map_file2realFCID.csv with 3 columns: The first one is the full path of the file, the second is the flow cell tag and the third one is the true flow cell, extracted directly from the fastw file using zless. It looks like this:

/home/proj/production/housekeeper-bundles/<bundle>/2018-02-15/<some_file_name_1>.fastq.gz,HF57YADXX^M,HF42YADXX
/home/proj/production/housekeeper-bundles/<bundle>/2018-02-15/<some_file_name_2>.fastq.gz,HF57YADXX^M,HF42YADXX
/home/proj/production/housekeeper-bundles/<bundle>/2018-02-15/<some_file_name_3>.fastq.gz,HF57YADXX^M,HF42YADXX
...

I could not get rid of the Mac new line remanent (^M) which is solved below.

The following script reads the csv file and updates the file paths and tags for every Housekeeper File in the list:

import csv
import os
import sys

from cg.apps.housekeeper.hk import HousekeeperAPI
from housekeeper.store.models import File, Tag
from utils.hk import get_hk_api

class FastqFileUpdater:
    """Class that encompasses the updating of tags and paths of a list of fastq files."""

    def __init__(self, input_file: str, hk_api: HousekeeperAPI, env: str, dry_run: bool):
        self.hk_api: HousekeeperAPI = hk_api
        self.env: str = env
        self.dry_run: bool = dry_run
        self.input_file: str = input_file
        self.original_file_path: str | None = None
        self.new_file_path: str | None = None
        self.flow_cell_tag: str | None = None
        self.real_flow_cell_id: str | None = None
        self.file: File | None = None

    @staticmethod
    def strip_flow_cell_tag(flow_cell_tag: str):
        """Return the flow cell tag without special characters for new line."""
        flow_cell_tag = flow_cell_tag.strip()
        if flow_cell_tag.endswith('^M'):
            return flow_cell_tag[:-2]
        return flow_cell_tag

    def initialise_file_attributes(self, row: list[str]) -> None:
        """Assigns values to the class attributes file_path, flow_cell_tag and real_flow_cell_id."""
        self.original_file_path, flow_cell_tag, self.real_flow_cell_id = row
        self.flow_cell_tag: str = self.strip_flow_cell_tag(flow_cell_tag=flow_cell_tag)
        self.file: File = self.hk_api.files(path=self.original_file_path).first()
        print(f"[READ] Class attributes updated for file {self.original_file_path}")

    def get_new_file_path(self) -> None:
        """Updates the new_file_path with the modified original file path."""
        path_parts: list[str] = self.original_file_path.strip().split("/")
        path_parts[2] = self.env
        file_name = path_parts[-1]
        new_file_name: str = f"{self.real_flow_cell_id}_{file_name}"
        new_parts: list[str] = path_parts[:-1] + [new_file_name]
        self.new_file_path = "/".join(new_parts)
        print(f"[PATH] New path generated for file {self.original_file_path}")

    def update_path_hasta(self):
        """Rename the file path in Hasta."""
        if self.dry_run:
            print("[PATH HASTA - dry run] Would have renamed" 
                  f"{self.original_file_path} for {self.new_file_path}")
            return
        os.rename(self.original_file_path, self.new_file_path)
        print(f"[PATH HASTA] Renamed {self.original_file_path} to {self.new_file_path} in Hasta")

    def update_entry_hk(self) -> None:
        """Update the file path in Housekeeper."""
        if self.dry_run:
            print("[PATH HK - dry run] Would have renamed path in Housekeeper for file" 
                  f"{self.original_file_path} to {self.new_file_path}")
            return
        self.file.path = self.new_file_path
        print("[PATH HK] Renamed path in Housekeeper for file"
              f"{self.original_file_path} to {self.new_file_path}")

    def update_paths(self):
        """Update the paths of current file in Hasta and Housekeeper if applicable."""
        if self.real_flow_cell_id not in self.original_file_path:
            self.get_new_file_path()
            self.update_path_hasta()
            self.update_entry_hk()
        else:
            print(f"[PATHS] No path modification needed for file {self.original_file_path}")

    def update_file_tag(self) -> None:
        """Replaces the old tag by the new tag."""
        new_tag: Tag = self.hk_api.get_tag(name=self.real_flow_cell_id)
        old_tag: Tag = self.hk_api.get_tag(name=self.flow_cell_tag)
        new_file_tags: list[Tag] = self.file.tags
        new_file_tags.remove(old_tag)
        new_file_tags.append(new_tag)
        if self.dry_run:
            print(f"[TAGS - dry-run] Would have updated tags for {self.original_file_path}")
            return
        self.file.tags = new_file_tags
        print(f"[TAGS] Successfully updated the tags for file {self.original_file_path}")

    def update_fastq_files(self):
        with open(self.input_file, "r") as csvfile:
            reader = csv.reader(csvfile)
            for row in reader:
                print("Stating iteration")
                self.initialise_file_attributes(row=row)
                self.update_paths()
                self.update_file_tag()
                if not self.dry_run:
                    self.hk_api.commit()

def main(input_file: str, env: str, dry_run: bool):
    hk_api: HousekeeperAPI = get_hk_api(env=env)
    file_updater = FastqFileUpdater(
        input_file=input_file, hk_api=hk_api, env=env, dry_run=dry_run
    )
    file_updater.update_fastq_files()

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python 02_fix_paths_and_tags.py input_csv_file, env, dry_run")
    else:
        input_csv_file: str = sys.argv[1]
        env = sys.argv[2]
        dry_run = False if sys.argv[3].lower() in ["0", "false"] else True
        print("======================== STARTING ======================")
        print(f"Setting environment to {env}")
        print(f"Setting dry run to {dry_run}")
        main(input_file=input_csv_file, env=env, dry_run=dry_run)
        print("================= FINISHED SUCCESSFULLY =================")

diitaz93 commented 1 year ago

UPDATE: It seems that the samples that have a mismatch in the flow cell id in the path and in the tags are top-ups, having the flow cell id of their previous flow cell in the tag

henrikstranneheim commented 1 year ago

# Remove ^M
perl -i -p -e's/\r/\n/g' "file"

If you dare go old perlish style

islean commented 1 year ago

So the current files don't have single flow cell in their path? Seems to me as if we are just prepending the "real" flow cell to the name - should we not remove the old one?

islean commented 1 year ago

Looks good! Could it be the case that we need to add any of the Flow cell tags? Or do we know that they exist?

diitaz93 commented 1 year ago

So the current files don't have single flow cell in their path? Seems to me as if we are just prepending the "real" flow cell to the name - should we not remove the old one?

There are two main groups, the ones that have the name and the ones that don't. It turned out to be (and I forgot to mention) that for the ones that have a flow cell name in the path is the correct one, the one that is wrong is always the tag. For those cases, only the tag is updated

diitaz93 commented 1 year ago

self.hk_api.get_tag(name=self.real_flow_cell_id)

the function

self.hk_api.get_tag(name=self.real_flow_cell_id)

checks if the tag exists and creates a new one if it doesn't :)

Vince-janv commented 1 year ago

If the conclusion is that the flow cell in the name is always correct, why do we have a function to update the paths?
If some category of flow cells in this list only appears once (like the NovaSeq X one) I think they can be dealt with manually instead of making the script more complex.

Logic looks solid though 👍

diitaz93 commented 1 year ago

If the conclusion is that the flow cell in the name is always correct, why do we have a function to update the paths?

In some cases the name does not have a flow cell, in which the name is added. If the name is already there we skip that step

If some category of flow cells in this list only appears once (like the NovaSeq X one) I think they can be dealt with manually instead of making the script more complex.

There is no distinction between flow cell types in this code, or which part are you referring to?

diitaz93 commented 11 months ago

Ran on stage and production without errors. Updated 258 files in production

Clinical-Genomics / housekeeper

Some fastq files have the wrong flow cell as tag or in their file path #178

Description

Solution