fomightez / sequencework

programs and scripts, mainly python, for analyses related to nucleic or protein sequences
24 stars 4 forks source link

AttributeError, plot_expression_across_chromosomes.py #1

Closed estyholt closed 5 years ago

estyholt commented 5 years ago

Hi,

I am trying to run the following:

python plot_expression_across_chromosomes.py Homo_sapiens.GRCh38.88.chr.gtf data.tsv --columns 1,9,4 --exp_desig ko --base_desig in

I get:

Reading annotation file and getting data on genes and chromosomes...plot_expression_across_chromosomes.py:569: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'. annotaton_file, header=None, names=col_names_to_apply, comment='#') sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. Information for 58137 genes parsed...Traceback (most recent call last): File "plot_expression_across_chromosomes.py", line 611, in chromosomes_in_roman_num = not bool(len([s for s in seqname_set if s.isdigit()]) > len(seqname_set)/2) #checks if most chromosomes seem to be digits and says they are roman numerals if that is false AttributeError: 'int' object has no attribute 'isdigit'

Here is the beginning of Homo_sapiens.GRCh38.88.chr.gtf:

!genome-build GRCh38.p10

!genome-version GRCh38

!genome-date 2013-12

!genome-build-accession NCBI:GCA_000001405.25

!genebuild-last-updated 2017-01

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

Here is the beginning of data.tsv, processed as here:

gene baseMean log2FoldChange FoldChange lfcSE stat pvalue padj rel_fix ENSG00000000003 3466.722148 0.134468968 1.097688699 0.175182735 0.767592582 0.442729271 0.787106136 1 ENSG00000000005 0.355448459 0.273834201 1.209016724 2.440359486 0.112210599 0.910656425 NA 1

Also, running time is quite long..

Thank you

fomightez commented 5 years ago

Sorry, it didn't work right out of the gate.

I suspect you could get around it by replacing the line chromosomes_in_roman_num = not bool(len([s for s in seqname_set if s.isdigit()]) with chromosomes_in_roman_num = True or chromosomes_in_roman_num=False depending if the chromosomes are in roman numeral form or not. (I work mainly with yeast so I don't know the answer off the top of my head.) Of course, if it is true, that may lead to it breaking still. I'll look into more of a long-term solution in a few hours.

fomightez commented 5 years ago

I changed it so that the chromosome designation column gets handled as if it a string even if it was reading in integers for the first column. Hopefully, it will work for you now.

When testing, I used the following command to make the GTF file smaller and then used that for testing:

head -199000 Homo_sapiens.GRCh38.88.chr.gtf > truncHomo_sapiens.GRCh38.88.chr.gtf
python plot_expression_across_chromosomes.py truncHomo_sapiens.GRCh38.88.chr.gtf data.tsv --columns 1,9,4 --exp_desig ko --base_desig in

As for speed, I don't think I can improve much the way the script currently reads input files. Pandas is actually reading that 1.4 Gb file and then the script was encountering an error almost immediately after that. In other words, Pandas .read_csv() method reading the 2.6 million lines of Homo_sapiens.GRCh38.88.chr.gtf is the rate-limiting step there. In theory, if I imposed data types while reading that may speed it up. But GTF files are not always standard, and so I was trying to keep options open. I am going to test some options to see if I can come up with any improvement.

fomightez commented 5 years ago

So I am seeing memory improvement with dtypes. In the code example, below first about a tenth of the human gtf file is read in normally and then with dtypes:

!head -266000 Homo_sapiens.GRCh38.88.chr.gtf > truncHomo_sapiens.GRCh38.88.chr.gtf
import pandas as pd
def mem_usage(pandas_obj):
    '''
    from https://www.dataquest.io/blog/pandas-big-data/
    '''
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)
col_names_to_apply = ("seqname", "source", "feature type", "start", 
    "end", "score", "strand", "frame", "attribute")
df = pd.read_csv(
    "truncHomo_sapiens.GRCh38.88.chr.gtf", sep='\t', header=None, low_memory=False,
    names=col_names_to_apply, comment='#') 
print(mem_usage(df))

col_names_to_apply = ("seqname", "source", "feature type", "start", 
    "end", "score", "strand", "frame", "attribute")
column_types = {"seqname":'category',
                "source":'category',
                "feature type":'category',
                "start":'int64',
                "end":'int64',
                "score":'category',
                "strand":'category',
                "frame":'category',
                "attribute":'category',
               }
df = pd.read_csv(
    "truncHomo_sapiens.GRCh38.88.chr.gtf", sep='\t', header=None, low_memory=True,
    names=col_names_to_apply, comment='#',dtype=column_types)
print(mem_usage(df))

Results:

225.55 MB
145.74 MB

However, where I was working at the time, I still couldn't even read in the full GTF. I need to test some more later when I can use a system where I can load the entire GTF and verify the speed improvement is there.

estyholt commented 5 years ago

Thank you, now it's running and producing a plot. The only problem now is that the order of the chromosomes on the x axis is not sequential.

fomightez commented 5 years ago

Yup, I was afraid of that.

However, I like that we made progress. And I am hopeful about this next hiccup since I think I had looked into controlling the order of the x-axis when I was working with roman numerals myself. (You'll note that chromosome number nine [designated by roman numeral IX] indicated is out of order in the current documentation plots.)

I'd definitely like to fix it for the case where chromosomes are designated by integers because that should be easily addressable. Much more so than roman numerals.

Can you tell me how soon it deviates from being sequential? I am going to see if I can reproduce similar output by generating mock data from that 1.4 Mb GTF file because I am trying to avoid asking you to send me some of your data. However, I worry data I generate that way may not recapitulate what you are seeing because I'll be using an essentially ordered list. My address is just my github user name at gmail.com if you had something you felt comfortable sending that produced the non-sequential outcome. Even if it is just the first column of something that produced the issue. I can always fill in mock values.

The other question I have is can you use what you have for now? Or did you need this yesterday and you'd take any kludgy trick I come up with to control the order ?

fomightez commented 5 years ago

Backing up to the speed issue, I was trying on a different system and definitely convinced myself to add dtypes. I couldn't even read in that GTF file on a remote system without them. Turns out it isn't just speed. It just won't work under Amazon Linux. It crashes out without dtypes like discussed here.

fomightez commented 5 years ago

Improvements were made yesterday to address the speed issue but I forgot to tag this issue in the commit. Reading the GTF and then compiling the necessary information have now been improved, see https://github.com/fomightez/sequencework/commit/2111c66fcf08e5849b8da0587416fdd9e0189136 .

After reading the GTF, I was building the dataframe a row at a time for each gene. That was tolerable with a 12 mb GTF from yeast, but not at all when scaled up to the 1.4 gb from human.

Fortunately the early steps of this script and the mock expression generator script were the same, and so I could improve this one while working on getting human data for addressing the order of the chromosomes along the x-axis. So now I have an idea of the problem you may be seeing. More on that soon.

fomightez commented 5 years ago

So I suspect you were seeing something like this? human_but_without_seqname_as_categorical_data_across_chrtemp

Or this? (Note that the X chromosome is after seven and the eleventh is before the tenth.) human_data_across_chr

Fortunately, I stumbled upon the solution to that while addressing a new issue I had noticed after I added the dtypes. (I am going to explain it here for my own record, but feel free to skip to below to see outcome.) I was trying to address the issue that since adding the dtypes, the plotting code that worked with all chromosomes (see 2nd figure posted here) no longer worked with individual chromosomes or subsets of chromosomes. Turns out that was happening because I made all the text-based columns categoricals and subsetting on just the selected chromosomes was working but when I applied groupby it was using the original categoricals as the grouping factors and causing errors when I tried to get last row from empty dataframes. The solution to allow the groupby to work with categorical to only result in groups for those from groups with data was to add observed=True setting to the groupby step. However, they working that out I realized by default most text-based columns just get assigned as objectdtype. Thinking that would be better than involving categoricals, I changed all dtypes to that and that enabled me to leave out observed=True setting to the groupby step. However, the order of the chromosomes along the x-axis that resulted was atrocious. Things were all over the place, like in the first image I posted in this reply. Fortunately, knowing I could deal with the chromosome names as categoricals by usingobserved=True setting for the groupby step to make plotting subsets of chromosomes possible again, I then tried just setting the chromosome name column to a category dtype and leaving all other text-based columns as object dtype. Amazingly, not only did things work again so that I could plot subsets of chromosomes, now the resulting x-axis order matched the original GTF. See example below: human_fins_data_across_chr

There are still some minor label placement issues for the MT and higher-numbered chromosomes but that can be sorted out by saving as SVG and editing in vector graphics software. Alternatively, some of it might be solved by adding a way to make the plot larger. However, I think this issue is closed.

estyholt commented 5 years ago

Thanks for your detailed answers. Chromosomes order is now fixed.

fomightez commented 5 years ago

Thank you for trying it and letting me know. It really helped improve it.

I have made a launchable Jupyter session that will have examples using both the human and yeast GTFs. It can be found here with the static version here. That will make it easier for other folks to use it and for me to test the basics and advanced features more easily.

Unfortunately, in that environment it seems the human section seems to fail due to restrictive computational/memory capacity. (Or at least the human section commands did when I first tried.)