Input of the groupNumtCluster_fromMultipleSamples.py

Thanks for sharing very interesting scripts. I have some questions. (1) First question is about the input of the groupNumtCluster_fromMultipleSamples.py file. I don't know which output file of searchNumtCluster_fromDiscordantReads.py can be used as the input file of this script.

According to the description of the groupNumtCluster_fromMultipleSamples.py script, "This script takes the cluster sum files generated from searchNumtCluster_fromDiscordantReads.py to look for shared NUMT clusters across different individuals.", sample.mt.disc.sam.cluster.summary.tsv is more like the input file. But according to the script code, groupNumtCluster_fromMultipleSamples.py operates on the input file mainly based on the four columns of ’sampleID‘, 'chr', 'start' and 'end'. Sample.mt.disc.sam.breakpointINPUT.tsv contains these four columns and is more like the input file.

In addition, according to some codes in file searchNumtCluster_fromDiscordantReads.py as shown below, do I need to split the ’Cluster_ID‘ column to get specific columns（’sampleID‘, 'chr', 'start', 'end'） as the input of groupNumtCluster_fromMultipleSamples.py?

df_pos = pd.DataFrame(output2['Cluster_ID'].str.split('_').tolist(),columns = ['chr','start','end','chrM','mtstart','mtend'])
del output2['Cluster_ID']
del output2['subCluster_No']
del output2['size']
output3 = pd.concat([output2, df_pos[['chr','start','end']]], axis=1)
output3 = output3.drop_duplicates(['chr','start','end'])
output3['start'] = output3['start'].astype(int) - 500
output3['end'] = output3['end'].astype(int) + 500 + 150
output3['Cluster_No'] = output3['Cluster_No'].astype(int)

(2) Second question, the method section of the article reads 'The NUMTs within a distance of 1,000 bp on both nuclear DNA and mtDNA were grouped as the same NUMT.' However, I did not find the code for NUMT clusters merging based on mtDNA position in groupNumtCluster_fromMultipleSamples.py. I would like to ask how you deal with this？

(3) Third question, whether the NUMT classification(Common, Rare, Ultra-rare, Private) is calculated based on the Cluster_ID column in allsamples.mt.disc.sam.cluster.summary.tsv obtained by merging all samples.

Thank you very much!

WeiWei060512 / NUMTs-detection

Input of the groupNumtCluster_fromMultipleSamples.py #8