EUCI: Unannotated clusters

Mass23 / NOMIS_ENSEMBLE

3 stars 1 forks source link

EUCI: Unannotated clusters #10

Closed susheelbhanu closed 3 years ago

susheelbhanu commented 3 years ago

Purpose: Generate statistics for unannotated clusters
To identify clusters that are truly "known unknowns" from those that are "unknown unknown"

Todo:

[x] BLAST
[x] tRNA
[x] Cluster identity
[x] Cluster GC
[x] Coverage

susheelbhanu commented 3 years ago

[x] (1) load cluster fasta
[x] (2) mean length of the sequences
[x] (3) std of the length of the sequences
[x] (4) Mean GC content of the sequences
[x] (5) Std GC content of the sequences
[x] (6) Mafft alignment
[x] (7) Similarity of the aligned sequences
[x] (8) EMBOSS cons consensus sequence creation
[x] (9) Blast of the consensus on nr (retrieve a few metrics)
[x] (10) tRNA check
[x] (11) tRNA amino acid

susheelbhanu commented 3 years ago

Working directory: /mnt/md1200/epfl_sber/massimo/EUCI_MG/selection_inference/sbusi/clusters

susheelbhanu commented 3 years ago

Using the file: clusters_min_9_seq_2_samp.tsv to get the cluster_list as follows:

cd /mnt/md1200/epfl_sber/massimo/EUCI_MG/selection_inference/sbusi/clusters
python
import pandas as pd
df=pd.read_csv("clusters_min_9_seq_2_samp.tsv", sep='\t')
df.drop(df.columns[0], axis=1, inplace=True)
gp=df.groupby('ClusterID')
gp_edited=gp[['Sequence']]
gp_edited.apply(lambda x: x.to_csv(str(x.name) + '.txt', sep='\t', header=False, index=False))

rename "Cluster" "Cluster_" *.txt
ls -1 *.txt | sed 's/Cluster_//g' | sed 's/.txt//g' > cluster_list

susheelbhanu commented 3 years ago

Run started at 2020-02-16:15:45:00

susheelbhanu commented 3 years ago

The missing components from here to be addressed outside of snakemake as downstream analyses