Starlitnightly / omicverse

A python library for multi omics included bulk, single cell and spatial RNA-seq analysis.
https://starlitnightly.github.io/omicverse/
GNU General Public License v3.0
277 stars 32 forks source link

Actually,not retain only the highest expressed gene #45

Closed user-tq closed 3 days ago

user-tq commented 6 months ago

The tutorial indicates that:"We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes" but in the code , It assumes that the data is sorted from large to small and sorted by the sum of each row.

def data_drop_duplicates_index(data:pd.DataFrame)->pd.DataFrame:
    r"""
    Drop the duplicated index of data.

    Arguments:
        data: The data to be processed.

    Returns:
        data: The data after dropping the duplicated index.
    """
    index=data.index
    data=data.loc[~index.duplicated(keep='first')]
    return data
data = pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()

data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv') 
data 
print(data.index.value_counts())
dds=ov.bulk.pyDEG(data)
x=dds.drop_duplicates_index()
print('... drop_duplicates_index success')
x

By observing the intermediate data, it can be seen that the 7SK in the tutorial is not the highest expression level. image image

Starlitnightly commented 6 months ago

Thanks for your advice, this bug will be fixed in the next version.

Starlitnightly commented 3 days ago

We have fixed this error in 1.6.4.