ammaraziz / flukit

GNU Lesser General Public License v3.0
2 stars 2 forks source link

subcommand split #8

Open ammaraziz opened 1 year ago

ammaraziz commented 1 year ago

New subcommand: split

split subcommand takes in a mutifasta file and splits into either individual or gene. No renaming is performed. Write protection is needed (never overwrite a file).

Usage:

flukit split -i input.fasta -o output_dir --split-by gene

The function write_sequences performs this operation currently, however there is no subcommand to handle splitting individually.

Output should be a directory and the function should NOT overwrite existing files. It should through a warning via print to the user.

write_sequences for ind needs to be changed to handle a dict or a list.

ammaraziz commented 1 year ago

To do for Arada:

  1. Clone the repo or pull latest
  2. Create new branch with appropriate name
  3. Create a new subcommand in flukit, see flukit.find as an example
  4. Write the internals
ammaraziz commented 1 year ago

There's a need to split by subtype. Current method of splitting fasta is done when writing out. This in hindsight is hacky. A better way to do this is to filter and split the meta data into lists of dataframes and then for each dataframe, create a corresponding sequences to write out.

aradahir commented 1 year ago
from Bio import SeqIO
import pandas as pd
import os

def subtype_categories(meta):
    gene = {
            '.1': 'pb2', 
            '.2': 'pb1', 
            '.3': 'pa' , 
            '.4': 'ha' , 
            '.5': 'np' , 
            '.6': 'na' , 
            '.7': 'm'  , 
            '.8': 'ns'
        }
    return meta['Seq No'].str[-2:].map(gene)

#input
meta_file = 'renamer2.csv'
fasta_file = 'multi.fasta'

def meta_split(meta_file, fasta_file):

    meta = pd.read_csv(meta_file)
    grouping = meta.groupby(
                    ["Sub Type",
                    "Passage History", 
                    subtype_categories(meta, gene)], group_keys=True)

    sub_files = {}
    for head, frame in grouping:
        sub_files.update({str(head[0]+'_'+head[1]+'_'+head[2]): list(frame['Seq No'])})

    path = os.getcwd()
    final_directory = os.path.join(path, r'files_split')
    if not os.path.exists(final_directory):
        os.makedirs(final_directory)

    for file in sub_files.items():
        sequences = []
                for seq in SeqIO.parse(fasta_file, "fasta"):
                      if seq.id in file[1]:
                          sequences.append(seq)
                SeqIO.write(sequences, str(final_directory+'/'+file[0]+'.fasta'), "fasta")