CFIA-NCFAD / nf-flu

Influenza genome analysis Nextflow workflow
MIT License
16 stars 10 forks source link

[BUG]: nf-flu improperly labels segment ID for PB1 and PB2 for Influenza B #65

Closed Codes1985 closed 8 months ago

Codes1985 commented 8 months ago

Is there an existing issue for this?

Description of the Bug/Issue

Hello!

We were in the process of preparing an upload of Influenza B sequences to GISAID, when we realized that nf-flu was incorrectly labelling PB1 as PB2 and PB2 as PB1.

As you know, the segment number is assigned based on segment length where segment 1 refers to the longest segment and 8 the shortest. For FluA, PB2 is the longest segment, and assigned as segment 1, while PB1 is the next longest and assigned as segment 2. Turns out for FluB, PB1 is the longest segment followed by PB2.

I noticed on line 32 in IRMA's init.sh script that this is accounted for: SEG_NUMBERS="B_PB1:1,B_PB2:2,A_PB2:1,A_PB1:2,PA:3,HA:4,NP:5,NA:6,M:7,NS:8"

If my understanding of how nf-flu works is correct, the segment number is being appended by IRMA, while the segment ID is being applied by nf-flu based off IRMA's annotation in parse_influenza_blast_results.py:

Lines 29-38:

SEGMENT_NAMES = { 1: "1_PB2", 2: "2_PB1", 3: "3_PA", 4: "4_HA", 5: "5_NP", 6: "6_NA", 7: "7_M", 8: "8_NS", }

and lines 481-484:

df_all_blast_pandas: pd.DataFrame = df_all_blast.to_pandas()
        # Convert segment number to segment name (1 -> "1_PB2")
        df_all_blast_pandas["sample_segment"] = df_all_blast_pandas["sample_segment"]. \
            apply(lambda x: SEGMENT_NAMES[int(x)])

And since IRMA has appended "1" to the FluB PB1 sequence and "2" to the FluB PB2 sequence, the PB1 sequences are being renamed to "1_PB2" and the FluB PB2 sequences to "2_PB1".

Thank you!

Nextflow command-line

nextflow cfia-ncfad-nf-flu-3.3.6/workflow/main.nf --input samplesheet.csv --platform nanopore --low_coverage 50 --clair3_user_variant_model /rerio/clair3_models/r1041_e82_400bps_hac_v420/ --outdir <OUTDIR> --major_allele_fraction 0.50 -profile singularity,slurm

Error Message

No error message for this issue.

Workflow Version

3.3.6, revision: e2872b8

Nextflow Executor

slurm

Nextflow Version

22.10.0

Java Version

No response

Hardware

HPC Cluster

Operating System (OS)

Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core

Conda/Container Engine

Singularity

Additional context

No response

peterk87 commented 8 months ago

Nice catch @Codes1985! I'll work on a fix.

peterk87 commented 8 months ago

This issue should be fixed in 3.3.7 with #66.

BLAST report and all derived results should show proper segment number and name for IBV:

Sample Sample Genome Segment Number Reference NCBI Accession Reference Subtype Genus
SRR25375797 1_PB1 OQ998010.1   Betainfluenzavirus
SRR25375797 2_PB2 OR052894.1   Betainfluenzavirus

Thanks again @Codes1985 for catching and reporting the issue! Hopefully it didn't cause too much trouble with your submissions to NCBI! Please let me know if you have any other issues.

Codes1985 commented 8 months ago

Thank you so much for fixing this so quickly, @peterk87! Yeah, not a big deal since we don't have too many FluB samples. I was basically last week years old when I discovered the segment numbering was different between FluA and FluB. I better turn in my Flu card! 😆 Thanks again and take care!