davidaknowles / leafcutter

Annotation-free quantification of RNA splicing. Yang I. Li, David A. Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, Jonathan K. Pritchard
http://davidaknowles.github.io/leafcutter/
Apache License 2.0
201 stars 112 forks source link

prepare_phenotype_table.py #200

Open yangchuhua opened 2 years ago

yangchuhua commented 2 years ago

To Whom It May Concern,

thanks for your effort in the development of leafcutter. It's really a helpful tool.

I used leafcutter_cluster_regtools_py3.py to get intron clustering.

I wonder if the row names of my results such as "chr1:17055:17233:clu1-" caused the error as the following: Parsing chromosome names... Starting... Traceback (most recent call last): File "/raid3/homedirs/yang/software/leafcutter/scripts/prepare_phenotype_table.py", line 198, in <module> main(args[0], get_chromosomes(args[0]), get_blacklist_chromosomes(options.cbl), int(options.npcs) ) File "/raid3/homedirs/yang/software/leafcutter/scripts/prepare_phenotype_table.py", line 126, in main fout[chr_].write("\t".join([chr_,s,e,chrom]+[str(x) for x in valRow])+'\n') KeyError: '1'

The perind.counts.gz file was used to get PCA for FastQTL analysis as the following. image

Best regards, Chuhua

n-hackert commented 2 years ago

Hey Chuhua,

first of all, I wanna say thanks to the leafcutter team for their amazing work and start with a small disclaimer: I have only just started using leafcutter, so my answer might not be the ideal solution to your problem.

However, I ran into the same problem that you described above, when working through the example sQTL workflow. For me the problem could be solved by adding a .replace("chr", "") to the name extraction part of the get_chromosomes function in the prepare_phenotype_table.py script.

def get_chromosomes(ratio_file):
    """Get chromosomes from table. Returns set of chromosome names"""
    try: open(ratio_file)
    except:
        sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
        return
    sys.stderr.write("Parsing chromosome names...\n")
    chromosomes = set()
    with gzip.open(ratio_file, 'rt') as f:
            f.readline()
            for line in f:
                ### this line was modified as described above ###
                chromosomes.add(line.split(":")[0].replace("chr", ""))
    return(chromosomes)

As far as I could see, the script relies on the names of the chromosomes being in the format "1", "2", "3", ... instead of "chr1", "chr2", "chr3", ... which is the format returned by the above function (given the format of the input file of course). Further down the pipeline this causes the key error, because the fout dictionary is initialized with the keys being "chr..." instead of just "...".

for i in chroms:
        fout[i] = file(ratio_file+".phen_chr"+i,'w')
        fout_ave = file(ratio_file+".ave",'w')

The chroms stems from the get_chromosomes function and causes the dictionary keys to be initialized as "chr...".

In another two sections of the script this "chr"-prefix is explicitly discarded to read the correct format from the file:

    for dic in stream_table(gzip.open(ratio_file),' '):
        ### this line explicitly removes the chr in the beginning of each line of the input file ###
        chrom = dic['chrom'].replace("chr", "")
        chr_ = chrom.split(":")[0]
        if chr_ in blacklist_chroms: continue
        NA_indices, valRow, aveReads = [], [], []
        tmpvalRow = []

and

for i in xrange(len(matrix)):
        chrom, s = geneRows[i].split()[:2]
        ### and this pattern appears again in the line below ###
        lst.append((chrom.replace("chr",""), int(s), "\t".join([geneRows[i]] + [str(x) for x in  matrix[i]])+'\n'))

These two sections are led me two believe that this "chr"-removal was simply missing from the get_chromosomes function, and indeed my "fix" seems to work just fine.

I hope this helps you and I did not misinterpret anything from the script or misuse any input files (would be really happy to hear from the leafcutter team).

Cheers, Nico