Closed frankligy closed 2 years ago
gtf = pd.read_csv('/Users/ligk2e/Desktop/gtfEnsembl91.txt',sep='\t')
gtf_sc = gtf.loc[gtf['feature']=='start_codon',:]
gtf_sc['gene'] = [item[0].split(' ')[1].strip('"') for item in gtf_sc['attribute'].str.split(';')]
dic = {}
for gene, sub in gtf_sc.groupby(by='gene'):
dic[gene] = list(sub['start'].unique())
sc = pd.Series(data=dic,name='start_codon').to_frame()
col = []
for lis in sc['start_codon']:
if len(lis) == 1:
col.append(lis)
else:
remainder = [item%3 for item in lis]
col.append(pd.Series(index=lis,data=remainder).drop_duplicates().index.tolist())
sc['non_redundant'] = col
sc.to_csv('/Users/ligk2e/Desktop/df_start_codon.txt',sep='\t')
Although I use 3-way in-silico translation, a lot of time, proteins prefer to use one ORF, so we want to annotate each obtained T antigen to prioritize certain neoantigens for experimental validation.