Open ammaraziz opened 1 year ago
To do for Arada:
flukit.find
as an exampleThere's a need to split by subtype
. Current method of splitting fasta is done when writing out. This in hindsight is hacky. A better way to do this is to filter and split the meta data into lists of dataframes and then for each dataframe, create a corresponding sequences to write out.
meta_split
which takes a pandas dataframe and filters by:
write_sequences
to handle metadata alongside sequencesfrom Bio import SeqIO
import pandas as pd
import os
def subtype_categories(meta):
gene = {
'.1': 'pb2',
'.2': 'pb1',
'.3': 'pa' ,
'.4': 'ha' ,
'.5': 'np' ,
'.6': 'na' ,
'.7': 'm' ,
'.8': 'ns'
}
return meta['Seq No'].str[-2:].map(gene)
#input
meta_file = 'renamer2.csv'
fasta_file = 'multi.fasta'
def meta_split(meta_file, fasta_file):
meta = pd.read_csv(meta_file)
grouping = meta.groupby(
["Sub Type",
"Passage History",
subtype_categories(meta, gene)], group_keys=True)
sub_files = {}
for head, frame in grouping:
sub_files.update({str(head[0]+'_'+head[1]+'_'+head[2]): list(frame['Seq No'])})
path = os.getcwd()
final_directory = os.path.join(path, r'files_split')
if not os.path.exists(final_directory):
os.makedirs(final_directory)
for file in sub_files.items():
sequences = []
for seq in SeqIO.parse(fasta_file, "fasta"):
if seq.id in file[1]:
sequences.append(seq)
SeqIO.write(sequences, str(final_directory+'/'+file[0]+'.fasta'), "fasta")
New subcommand:
split
split
subcommand takes in a mutifasta file and splits into either individual or gene. No renaming is performed. Write protection is needed (never overwrite a file).Usage:
The function
write_sequences
performs this operation currently, however there is no subcommand to handle splitting individually.Output should be a directory and the function should NOT overwrite existing files. It should through a warning via print to the user.
write_sequences
forind
needs to be changed to handle a dict or a list.