The tutorial indicates that:"We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes"
but in the code , It assumes that the data is sorted from large to small and sorted by the sum of each row.
def data_drop_duplicates_index(data:pd.DataFrame)->pd.DataFrame:
r"""
Drop the duplicated index of data.
Arguments:
data: The data to be processed.
Returns:
data: The data after dropping the duplicated index.
"""
index=data.index
data=data.loc[~index.duplicated(keep='first')]
return data
data = pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()
data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
data
print(data.index.value_counts())
dds=ov.bulk.pyDEG(data)
x=dds.drop_duplicates_index()
print('... drop_duplicates_index success')
x
By observing the intermediate data, it can be seen that the 7SK in the tutorial is not the highest expression level.
The tutorial indicates that:"We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes" but in the code , It assumes that the data is sorted from large to small and sorted by the sum of each row.
By observing the intermediate data, it can be seen that the 7SK in the tutorial is not the highest expression level.
![image](https://github.com/Starlitnightly/omicverse/assets/72878144/d2e797c9-93d5-45c1-b8b1-0604ed286936)