Open olgabot opened 4 years ago
Here's some starter code:
filenames = glob.iglob(f'{outdir}/diamond/blastp/*.tsv')
dfs = []
DIAMOND_BLASTP_COLUMNS = ['read_id', 'subject_id', 'percent_identity', 'e_value', 'bitscore',
'subject_title', 'subject_taxid', 'subject_species', 'subject_kingdom',
'subject_superkingdom',
'subject_phylum']
def read_diamond_blastp_output(filename):
df = pd.read_csv(filename, sep='\t', names=DIAMOND_BLASTP_COLUMNS)
return df
n = 0
for filename in tqdm(filenames):
filesize = os.path.getsize(filename)
if filesize > 0:
n += 1
basename = os.path.basename(filename)
split = basename.split('__')
utar_id = split[0]
df = read_diamond_blastp_output(filename)
df['region_id'] = utar_id
dfs.append(df)
print(f'Number of regions with homology: {n}')
diamond_results = pd.concat(dfs)
print(diamond_results.shape)
diamond_results.head()
Ignore the NP_01232
and species part of the description
pattern = '\d+.\d(.+)\[[\w ]+\]'
diamond_results.subject_title.head().str.extract(pattern)
Add columns for just the subject description, whether it's only a predicted gene or a "real" gene
diamond_results['subject_description_with_predicted'] = diamond_results.subject_title.str.extract(pattern)
diamond_results['is_predicted'] = diamond_results.subject_title.str.contains('PREDICTED')
diamond_results['subject_description'] = diamond_results.subject_description_with_predicted.str.split('PREDICTED: ').str[-1]
diamond_results.head()
described here: https://github.com/czbiohub/nf-predictorthologs/issues/61, with most_likely_sequence
function defined
diamond_results['e_value_inverse'] = 1/diamond_results['e_value']
diamond_predictions = diamond_results.groupby('region_id').apply(lambda x: most_likely_sequence(x, id_col='subject_description', weight_col='e_value_inverse'))
print(diamond_predictions.shape)
diamond_predictions.head()
For each query item, summarize the highest matching protein sequences as output by DIAMOND.
The query item differs depending on the input:
Summarize the DIAMOND blast output, which looks like this:
This summarizer could filter for NP-only sequences (https://github.com/czbiohub/nf-predictorthologs/issues/28)