ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

halLiftover gives wrong strand information #1162

Open WagnerNils opened 1 year ago

WagnerNils commented 1 year ago

Hi, I want to liftover from human to any species in the zoonomia halfile. The liftover produces different results for the strand information depending on whether the '--outPSL' flag is used or not. E.g. when I perform a dummy liftover from human to human from the minus strand, the PSL output produces results on the plus strand. Without the '--outPSL' flag the minus strand is correctly reported. Here is a simple example to reproduce this:

Input bed file: chrom Start End strand
chr19 58352098 58352099 -
halLiftover = './cactus/cactus-bin-v2.5.2/bin/halLiftover'
halfile = '605-vertebrate-2020.hal' #same for '241-mammalian-2020v2.hal'
human_bed = path_to_above_bed_file
target_species = 'Homo_sapiens'
target_species_bed = path_to_liftover_bed_file
target_species_psl = path_to_liftover_psl_file

Output (bed):

pd.read_csv(target_species_bed, sep='\t', names=['tName', 'tStart', 'tEnd', 'tStrand'])
tName tStart tEnd tStrand
chr19 58352098 58352099 -
f'{halLiftover} --outPSL {halfile} Homo_sapiens {human_bed} {target_species} {target_species_psl}'
def read_psl(path):
    column_names = [
        'match', #Number of bases that match that aren't repeats.
        'mismatch', #Number of bases that don't match.
        'repMatch', #Number of bases that match but are part of repeats.
        'nCount', #Number of 'N' bases.
        'qNumInsert', #Number of inserts in query.
        'qBaseInsert', #Number of bases inserted into query.
        'tNumInsert', #Number of inserts in target.
        'tBaseInsert', #Number of bases inserted into target.
        'strand', #'+' or '-' for query strand. For translated alignments, second '+' or '-' is for genomic strand.
        'qName', #Query sequence name.
        'qSize', #Query sequence size.
        'qStart', #Alignment start position in query.
        'qEnd', #Alignment end position in query.
        'tName', #Target sequence name.
        'tSize', #Target sequence size.
        'tStart', #Alignment start position in target.
        'tEnd', #Alignment end position in target.
        'blockCount', #Number of blocks in the alignment (a block is a contiguous matched sequence).
        'blockSizes', #Comma-separated list of sizes of each block.
        'qStarts', #Comma-separated list of start position of each block in query.
        'tStarts', #Comma-separated list of start position of each block in target.
    ]
return pd.read_csv(path, sep="\t", names=column_names)
read_psl(target_species_psl)[['strand', 'qName', 'qStart', 'qEnd', 'tName',  'tStart', 'tEnd']]
Output (PSL): strand qName qStart qEnd tName tStart tEnd
++ chr19 58352098 58352099 chr19 58352098 58352099
WagnerNils commented 1 year ago

Continuing on this; I tried to use the strand information from the bed output (not using 'outPSL' flag). However, when lifting to other species the bed file output seems to be sometimes wrong and the PSL output correct (when comparing to what is displayed in USCS browser).

Input bed file: chrom Start End strand
chr10 237585 237586 +
halLiftover = './cactus/cactus-bin-v2.5.2/bin/halLiftover'
halfile = '605-vertebrate-2020.hal' #same for '241-mammalian-2020v2.hal'
human_bed = path_to_above_bed_file
target_species = 'Ateles_geoffroyi'
target_species_bed = path_to_liftover_bed_file
target_species_psl = path_to_liftover_psl_file

Output (bed):

pd.read_csv(target_species_bed, sep='\t', names=['tName', 'tStart', 'tEnd', 'tStrand'])
tName tStart tEnd tStrand
AteGeo_scaffold_8064 13504 13505 +

What is the best way to use the liftover functionality in cactus?

f'{halLiftover} --outPSL {halfile} Homo_sapiens {human_bed} {target_species} {target_species_psl}'
read_psl(target_species_psl)[['strand', 'qName', 'qStart', 'qEnd', 'tName',  'tStart', 'tEnd']]
Output (PSL): strand qName qStart qEnd tName tStart tEnd
+- chr10 237585 237586 AteGeo_scaffold_8064 13504 13505

Double checking with the USCS browser for the target species the minus strand seems to be proper for this genomic region: image